feat: Add null-aware anti join support#19635
Conversation
| query IT | ||
| SELECT t1_id, t1_name FROM join_test_left WHERE t1_id NOT IN (SELECT t2_id FROM join_test_right) ORDER BY t1_id; | ||
| ---- | ||
| NULL e |
There was a problem hiding this comment.
The existing test was expecting (NULL, 'e') to be returned by a NOT IN query when the subquery contains NULL values. This is incorrect according to SQL semantics.
|
Thanks @viirya for taking care on this, I'll check this out early next week! |
| 2 b | ||
| 3 c | ||
| 4 d | ||
| NULL e |
There was a problem hiding this comment.
NULL, e
shouldn't be here
D SELECT * FROM outer_table
WHERE id NOT IN (SELECT id FROM inner_table_no_null)
OR id NOT IN (SELECT id FROM inner_table2);
┌───────┬─────────┐
│ id │ value │
│ int32 │ varchar │
├───────┼─────────┤
│ 1 │ a │
│ 3 │ c │
│ 2 │ b │
│ 4 │ d │
└───────┴─────────┘
There was a problem hiding this comment.
The test expectation was indeed incorrect according to SQL semantics.
The Problem
Test 9 has the query:
SELECT * FROM outer_table
WHERE id NOT IN (SELECT id FROM inner_table_no_null)
OR id NOT IN (SELECT id FROM inner_table2);
For the NULL row:
- NULL NOT IN (2, 4) = UNKNOWN
- NULL NOT IN (1, 3) = UNKNOWN
- UNKNOWN OR UNKNOWN = UNKNOWN → should be filtered out
But the test was expecting (NULL, 'e') to be included, which is wrong.
Root Cause
When NOT IN subqueries appear in OR conditions, DataFusion uses RightMark joins instead of LeftAnti joins:
- Mark joins add a boolean "mark" column indicating whether each row had a match
- The filter then evaluates NOT mark OR NOT mark
- The problem: Mark joins treat NULL keys as non-matching (FALSE) instead of UNKNOWN
- This causes NOT FALSE OR NOT FALSE = TRUE, incorrectly including the NULL row
Why This Happens
Mark joins are designed to handle complex boolean expressions (like OR) by converting the subquery check into a boolean column. However, they don't implement null-aware semantics - the mark column is never NULL, even when it should be UNKNOWN due to NULL join keys.
The Solution (For Now)
The proper fix would be to implement null-aware support for mark joins, making the mark column nullable and setting it to NULL when join keys are NULL. However, this is a complex change that affects the core join implementation.
For now, I've:
- Kept the test as-is (returning NULL row)
- Added detailed comments documenting this as a KNOWN LIMITATION
- Marked it as a TODO for future implementation
This way, the limitation is clearly documented and users/developers are aware of the issue, while we can address it properly in a future enhancement.
There was a problem hiding this comment.
The above is the analysis from AI. I think that's said that the test expectation failure is on mark joins instead of the null-aware anti joins in this PR, i.e., it is an existing bug.
There was a problem hiding this comment.
Why We Cannot Simply Use LeftAnti Joins
Short Answer: Because LeftAnti joins filter rows immediately, while OR conditions need to evaluate boolean expressions from multiple subqueries simultaneously.
The Fundamental Difference:
- LeftAnti Join (filtering):
SELECT * FROM outer_table
WHERE id NOT IN (SELECT id FROM subquery)
- The join filters out matching rows directly
- Result: rows that don't match - OR Condition (boolean evaluation):
SELECT * FROM outer_table
WHERE id NOT IN (SELECT id FROM subquery1)
OR id NOT IN (SELECT id FROM subquery2)
- Need boolean values from BOTH subqueries
- Then evaluate: NOT match1 OR NOT match2
- Can't do this with filtering joins alone
Why Mark Joins Are Used:
- Mark joins add a boolean column instead of filtering
- This allows complex boolean expressions like OR, AND, NOT to be evaluated in a subsequent Filter operator
- Example: WHERE (NOT mark1 OR NOT mark2) AND other_condition
The Current Problem:
- Mark joins don't support null-aware semantics
- They set mark = FALSE when no match, but should set mark = NULL when join key is NULL
Why It's Complex to Fix:
- The mark column is created deep in the join execution code (build_batch_from_indices)
- That function doesn't currently have access to:
- The null_aware flag
- The join key columns (to check if they're NULL)
- Would require threading these through multiple layers of the codebase
We can't use LeftAnti because it filters instead of producing boolean values, and implementing null-aware mark joins requires significant refactoring of the join execution internals.
I will leave it to future work.
f5514c4 to
dadc47c
Compare
|
run benchmarks |
|
run benchmark tpch |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
hmmm... |
This commit implements Phase 1 of null-aware anti join support for HashJoin LeftAnti operations, enabling correct SQL NOT IN subquery semantics with NULL values. - Add `null_aware: bool` field to HashJoinExec struct - Add validation: null_aware only for LeftAnti, single-column joins - Update all HashJoinExec::try_new() call sites (17 locations) - Add `probe_side_has_null` flag to track NULLs in probe side - Implement NULL detection during probe phase - Filter NULL-key rows during final emission stage - Add early exit when probe side contains NULL - Add 5 test functions with 17 test variants - Test scenarios: probe NULL, build NULL, no NULLs, validation - Add helper function `build_table_two_cols()` for nullable test data For `SELECT * FROM t1 WHERE c1 NOT IN (SELECT c2 FROM t2)`: 1. If c2 contains NULL → return 0 rows (three-valued logic) 2. If c1 is NULL → that row not in output 3. No NULLs → standard anti join behavior - Single-column join keys only - Must manually set null_aware=true (no planner integration yet) - LeftAnti join type only - All 17 null-aware tests passing - All 610 hash join tests passing Addresses issue apache#10583 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit implements Phase 2 of null-aware anti join support, enabling automatic detection and configuration of null-aware semantics for SQL NOT IN subqueries. DataFusion now automatically provides correct SQL NOT IN semantics with three-valued logic. When users write NOT IN subqueries, the optimizer automatically detects them and enables null-aware execution. - Added `null_aware: bool` field to `Join` struct in logical plan - Updated `Join::try_new()` and related APIs to accept null_aware parameter - Added `LogicalPlanBuilder::join_detailed_with_options()` for explicit null_aware control - Updated all Join construction sites across the codebase - Modified `DecorrelatePredicateSubquery` optimizer to automatically set `null_aware: true` for LeftAnti joins (NOT IN subqueries) - Uses new `join_detailed_with_options()` API to pass the flag - Conservative approach: all LeftAnti joins use null-aware semantics - Added checks in `JoinSelection` physical optimizer to prevent swapping null-aware anti joins - Null-aware LeftAnti joins cannot be swapped to RightAnti because: - Validation only allows LeftAnti with null_aware=true - NULL-handling semantics are asymmetric between sides - Added checks in 5 locations: try_collect_left, partitioned_hash_join, partition mode optimization, and hash_join_swap_subrule - Added new SQL logic test file with 13 comprehensive test scenarios - Tests cover: NULL in subquery, NULL in outer table, empty subquery, complex expressions, multiple NOT IN conditions, correlated subqueries - Includes EXPLAIN tests to verify correct plan generation - All existing optimizer and hash join tests continue to pass - datafusion/expr/src/logical_plan/plan.rs - datafusion/expr/src/logical_plan/builder.rs - datafusion/expr/src/logical_plan/tree_node.rs - datafusion/optimizer/src/decorrelate_predicate_subquery.rs - datafusion/optimizer/src/eliminate_cross_join.rs - datafusion/optimizer/src/eliminate_outer_join.rs - datafusion/optimizer/src/extract_equijoin_predicate.rs - datafusion/physical-optimizer/src/join_selection.rs - datafusion/physical-optimizer/src/enforce_distribution.rs - datafusion/core/src/physical_planner.rs - datafusion/proto/src/physical_plan/mod.rs - datafusion/sqllogictest/test_files/null_aware_anti_join.slt (new) Before (Phase 1 - manual): ```rust HashJoinExec::try_new(..., true /* null_aware */) ``` After (Phase 2 - automatic): ```sql SELECT * FROM orders WHERE order_id NOT IN (SELECT order_id FROM cancelled) ``` The optimizer automatically handles null-aware semantics. - SQL logic tests: All passed - Optimizer tests: 568 passed - Hash join tests: 610 passed - Physical optimizer tests: 16 passed 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The previous implementation incorrectly applied null-aware semantics to ALL
LeftAnti joins, including NOT EXISTS subqueries. This was wrong because:
- **NOT IN**: Uses three-valued logic (TRUE/FALSE/UNKNOWN), requires null-aware
- **NOT EXISTS**: Uses two-valued logic (TRUE/FALSE), should NOT be null-aware
```sql
-- Setup: customers has (1, 2, 3, NULL), banned has (2, NULL)
-- NOT IN - Correctly returns empty (null-aware)
SELECT * FROM customers WHERE id NOT IN (SELECT id FROM banned);
-- Result: Empty (correct - NULL in subquery makes all comparisons UNKNOWN)
-- NOT EXISTS - Was incorrectly returning empty (bug)
SELECT * FROM customers c
WHERE NOT EXISTS (SELECT 1 FROM banned b WHERE c.id = b.id);
-- Expected: (1, 3, NULL) - NULL=NULL is FALSE, so no matches for these rows
-- Actual (buggy): Empty - incorrectly using null-aware semantics
```
In `decorrelate_predicate_subquery.rs`, line 424:
```rust
let null_aware = matches!(join_type, JoinType::LeftAnti);
```
This set `null_aware=true` for ALL LeftAnti joins, but it should only be
true for NOT IN (InSubquery), not NOT EXISTS (Exists).
The `SubqueryInfo` struct already distinguishes between them:
- **NOT IN**: Created with `new_with_in_expr()` → `in_predicate_opt` is `Some(...)`
- **NOT EXISTS**: Created with `new()` → `in_predicate_opt` is `None`
Fixed by checking both conditions:
```rust
let null_aware = matches!(join_type, JoinType::LeftAnti)
&& in_predicate_opt.is_some(); // Only NOT IN, not NOT EXISTS
```
**File**: `datafusion/optimizer/src/decorrelate_predicate_subquery.rs`
- Updated null_aware detection to only apply to NOT IN (lines 420-426)
- Added comprehensive comments explaining the distinction
- Check `in_predicate_opt.is_some()` to distinguish NOT IN from NOT EXISTS
**File**: `datafusion/sqllogictest/test_files/null_aware_anti_join.slt`
Added 5 new test scenarios (Tests 14-18):
**Test 14**: Direct comparison of NOT IN vs NOT EXISTS with NULLs
- NOT IN with NULL → empty result (null-aware)
- NOT EXISTS with NULL → returns non-matching rows (NOT null-aware)
- EXPLAIN verification
**Test 15**: NOT EXISTS with no NULLs
**Test 16**: NOT EXISTS with correlated subquery
**Test 17**: NOT EXISTS with all-NULL subquery
- Shows that NOT EXISTS returns all rows (NULL=NULL is FALSE)
- Compares with NOT IN which correctly returns empty
**Test 18**: Nested NOT EXISTS and NOT IN
- Verifies correct interaction between the two
```bash
cargo test -p datafusion-sqllogictest --test sqllogictests -- null_aware_anti_join
cargo test -p datafusion-sqllogictest --test sqllogictests subquery.slt
cargo test -p datafusion-optimizer --lib
cargo test -p datafusion-physical-plan --lib hash_join
```
This fix ensures DataFusion correctly implements SQL semantics:
- NOT IN subqueries now correctly use null-aware anti join (three-valued logic)
- NOT EXISTS subqueries now correctly use regular anti join (two-valued logic)
Users can now reliably use both NOT IN and NOT EXISTS with confidence that
NULL handling follows SQL standards.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Fixed compilation errors in plan.rs test code that were missing the null_aware parameter in Join::try_new() calls and direct Join struct construction. Changes: - Added null_aware: false to 7 Join::try_new() calls in test functions - Added null_aware: false to 1 direct Join struct construction All tests pass except for one pre-existing failure in expr_rewriter::order_by::test::rewrite_sort_cols_by_agg_alias which is unrelated to null-aware joins.
82af167 to
4179501
Compare
Collapsed nested if statements in join_keys_may_be_null() function
to address clippy::collapsible_if warnings.
Changed from:
if let Ok(field) = schema.field_from_column(&col) {
if field.as_ref().is_nullable() { ... }
}
To:
if let Ok(field) = schema.field_from_column(&col)
&& field.as_ref().is_nullable()
{ ... }
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
I'll also run tpcds later today |
|
run benchmarks |
|
🤖 |
| 07)------------AggregateExec: mode=FinalPartitioned, gby=[p_brand@0 as p_brand, p_type@1 as p_type, p_size@2 as p_size, alias1@3 as alias1], aggr=[] | ||
| 08)--------------RepartitionExec: partitioning=Hash([p_brand@0, p_type@1, p_size@2, alias1@3], 4), input_partitions=4 | ||
| 09)----------------AggregateExec: mode=Partial, gby=[p_brand@1 as p_brand, p_type@2 as p_type, p_size@3 as p_size, ps_suppkey@0 as alias1], aggr=[] | ||
| 10)------------------HashJoinExec: mode=CollectLeft, join_type=LeftAnti, on=[(ps_suppkey@0, s_suppkey@0)] |
There was a problem hiding this comment.
This still is transformed to CollectLeft, I wonder why?
There was a problem hiding this comment.
Are the join keys are nullable?
There was a problem hiding this comment.
From the benchmarks, looks like q16 has no regression.
There was a problem hiding this comment.
AFAIK those keys should be non-nullable.
|
🤖: Benchmark completed Details
|
|
checked TPCDS |
|
run benchmark tpch |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark tpcds |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
Thanks @Dandandan @martin-g @comphead |
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> - Closes apache#10583. ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> This patch implements null-aware anti join support for HashJoin LeftAnti operations, enabling correct SQL NOT IN subquery semantics with NULL values. ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
|
Hey, just note that this had a breaking change as the |
## Which issue does this PR close? - Related to #19692 ## Rationale for this change @rluvaton noted some issues with the 53 upgrade guide #19692 (comment): > FYI, the migration guide says 53.0.0 was not released yet and it miss the following breaking changes: > > #20319 - removed deprecated function from ExecutionPlan trait > #19635 - added argument to HashJoinExec::try_new function ## What changes are included in this PR? Fix the above ## Are these changes tested? Yes by CI ## Are there any user-facing changes? Better docs
## Which issue does this PR close? - Related to apache#19692 ## Rationale for this change @rluvaton noted some issues with the 53 upgrade guide apache#19692 (comment): > FYI, the migration guide says 53.0.0 was not released yet and it miss the following breaking changes: > > apache#20319 - removed deprecated function from ExecutionPlan trait > apache#19635 - added argument to HashJoinExec::try_new function ## What changes are included in this PR? Fix the above ## Are these changes tested? Yes by CI ## Are there any user-facing changes? Better docs
Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
This patch implements null-aware anti join support for HashJoin LeftAnti operations, enabling correct SQL NOT IN subquery semantics with NULL values.
Are these changes tested?
Are there any user-facing changes?