Summary
arrays_overlap is marked as Incompatible in Comet, but the specific incompatibility is not documented. This issue tracks documenting and potentially fixing the behavior difference.
Spark Specification
According to Spark's arrays_overlap behavior:
- Returns
true if at least one element exists in both arrays
- Returns
false if no common elements are found AND no null elements exist
- Returns
null if no common elements are found BUT null elements exist in either array (three-valued logic)
Examples:
SELECT arrays_overlap(array(1, 2, 3), array(3, 4, 5));
-- Spark returns: true
SELECT arrays_overlap(array(1, 2), array(3, 4));
-- Spark returns: false
SELECT arrays_overlap(array(1, null, 3), array(4, 5));
-- Spark returns: null (because null element exists, result is indeterminate)
SELECT arrays_overlap(array(1, null, 3), array(1, 4));
-- Spark returns: true (found common element 1)
Current Comet Behavior
Comet uses DataFusion's array_has_any function. The specific null handling behavior may differ:
- DataFusion may return
false instead of null when no overlap is found but nulls exist
Current Tests
Looking at CometArrayExpressionSuite.scala:
checkSparkAnswerAndOperator(sql(
"SELECT arrays_overlap(array('a', null), array('b', null)) from t1 where _1 is not null"))
Tests exist but the expression is marked as Incompatible, requiring allow_incompatible=true to run.
Possible Solutions
- Verify actual behavior difference - run specific test cases comparing Spark vs Comet
- Custom Rust implementation if DataFusion doesn't match Spark's three-valued null logic
- Post-processing - wrap result to check for null elements and convert false to null
Note: This issue was generated with AI assistance.
Summary
arrays_overlapis marked asIncompatiblein Comet, but the specific incompatibility is not documented. This issue tracks documenting and potentially fixing the behavior difference.Spark Specification
According to Spark's
arrays_overlapbehavior:trueif at least one element exists in both arraysfalseif no common elements are found AND no null elements existnullif no common elements are found BUT null elements exist in either array (three-valued logic)Examples:
Current Comet Behavior
Comet uses DataFusion's
array_has_anyfunction. The specific null handling behavior may differ:falseinstead ofnullwhen no overlap is found but nulls existCurrent Tests
Looking at
CometArrayExpressionSuite.scala:checkSparkAnswerAndOperator(sql( "SELECT arrays_overlap(array('a', null), array('b', null)) from t1 where _1 is not null"))Tests exist but the expression is marked as
Incompatible, requiringallow_incompatible=trueto run.Possible Solutions