-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Describe the bug
When evaluating an AND expression, if fewer than 20% of the LHS values are true, the evaluator will use a "pre selection" strategy wherein we evaluate the right hand only for those rows where the left hand is true.
However, there is a bug in how scalars are handled for the right hand:
datafusion/datafusion/physical-expr/src/expressions/binary.rs
Lines 979 to 982 in 9deec2a
| let right_boolean_array = match &right_result { | |
| ColumnarValue::Array(array) => array.as_boolean(), | |
| ColumnarValue::Scalar(_) => return Ok(right_result), | |
| }; |
For any scalar RHS we just return the scalar, which is not generally correct.
For example, in the expession (x = 5 AND true) evaluated against [3, 5, 10, 12, 1] this will return [true, true, true, true], whereas the correct result is [false, true, false, false, false].
To Reproduce
This is somewhat hard to reproduce as expressions that trigger the issue will generally be optimized away, however it can be triggered by unoptimized expressions or expressions that are manually constructed.
This test exhibits the behavior
#[test]
fn test_and_true_preselection_returns_lhs() {
let schema =
Arc::new(Schema::new(vec![Field::new("c", DataType::Boolean, false)]));
let c_array = Arc::new(BooleanArray::from(vec![false, true, false, false, false]))
as ArrayRef;
let batch = RecordBatch::try_new(Arc::clone(&schema), vec![Arc::clone(&c_array)])
.unwrap();
let expr = logical2physical(&logical_col("c").and(expr_lit(true)), &schema);
let result = expr.evaluate(&batch).unwrap();
let ColumnarValue::Array(result_arr) = result else {
panic!("Expected ColumnarValue::Array");
};
let expected: Vec<_> = c_array.as_boolean().iter().collect();
let actual: Vec<_> = result_arr.as_boolean().iter().collect();
assert_eq!(
expected, actual,
"AND with TRUE must equal LHS even with PreSelection"
);
}Expected behavior
No response
Additional context
No response