Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions datafusion/core/tests/fuzz_cases/pruning.rs
Original file line number Diff line number Diff line change
Expand Up @@ -110,13 +110,27 @@ async fn test_utf8_not_like_prefix() {
.await;
}

#[tokio::test]
async fn test_utf8_not_like_ecsape() {
Utf8Test::new(|value| col("a").not_like(lit(format!("\\%{}%", value))))
.run()
.await;
}
Comment on lines +113 to +118
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love these fuzz tests! These 3 lines of code give me great confidence that this PR does the right thing 😄


#[tokio::test]
async fn test_utf8_not_like_suffix() {
Utf8Test::new(|value| col("a").not_like(lit(format!("{}%", value))))
.run()
.await;
}

#[tokio::test]
async fn test_utf8_not_like_suffix_one() {
Utf8Test::new(|value| col("a").not_like(lit(format!("{}_", value))))
.run()
.await;
}

/// Fuzz testing for UTF8 predicate pruning
/// The basic idea is that query results should always be the same with or without stats/pruning
/// If we get this right we at least guarantee that there are no incorrect results
Expand Down
227 changes: 214 additions & 13 deletions datafusion/physical-optimizer/src/pruning.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1590,6 +1590,7 @@ fn build_statistics_expr(
)),
))
}
Operator::NotLikeMatch => build_not_like_match(expr_builder)?,
Operator::LikeMatch => build_like_match(expr_builder).ok_or_else(|| {
plan_datafusion_err!(
"LIKE expression with wildcard at the beginning is not supported"
Expand Down Expand Up @@ -1638,6 +1639,19 @@ fn build_statistics_expr(
Ok(statistics_expr)
}

/// returns the string literal of the scalar value if it is a string
fn unpack_string(s: &ScalarValue) -> Option<&str> {
s.try_as_str().flatten()
}

fn extract_string_literal(expr: &Arc<dyn PhysicalExpr>) -> Option<&str> {
if let Some(lit) = expr.as_any().downcast_ref::<phys_expr::Literal>() {
let s = unpack_string(lit.value())?;
return Some(s);
}
None
}

/// Convert `column LIKE literal` where P is a constant prefix of the literal
/// to a range check on the column: `P <= column && column < P'`, where P' is the
/// lowest string after all P* strings.
Expand All @@ -1650,19 +1664,6 @@ fn build_like_match(
// column LIKE '%foo%' => min <= '' && '' <= max => true
// column LIKE 'foo' => min <= 'foo' && 'foo' <= max

/// returns the string literal of the scalar value if it is a string
fn unpack_string(s: &ScalarValue) -> Option<&str> {
s.try_as_str().flatten()
}

fn extract_string_literal(expr: &Arc<dyn PhysicalExpr>) -> Option<&str> {
if let Some(lit) = expr.as_any().downcast_ref::<phys_expr::Literal>() {
let s = unpack_string(lit.value())?;
return Some(s);
}
None
}

// TODO Handle ILIKE perhaps by making the min lowercase and max uppercase
// this may involve building the physical expressions that call lower() and upper()
let min_column_expr = expr_builder.min_column_expr().ok()?;
Expand Down Expand Up @@ -1710,6 +1711,80 @@ fn build_like_match(
Some(combined)
}

// For predicate `col NOT LIKE 'const_prefix%'`, we rewrite it as `(col_min NOT LIKE 'const_prefix%' OR col_max NOT LIKE 'const_prefix%')`.
//
// The intuition is that if both `col_min` and `col_max` begin with `const_prefix` that means
// **all** data in this row group begins with `const_prefix` as well (and therefore the predicate
// looking for rows that don't begin with `const_prefix` can never be true)
fn build_not_like_match(
expr_builder: &mut PruningExpressionBuilder<'_>,
) -> Result<Arc<dyn PhysicalExpr>> {
// col NOT LIKE 'const_prefix%' -> !(col_min LIKE 'const_prefix%' && col_max LIKE 'const_prefix%') -> (col_min NOT LIKE 'const_prefix%' || col_max NOT LIKE 'const_prefix%')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not true either

the truth is (or should be):

col NOT LIKE 'const_prefix%' ->
   col_max < 'const_prefix' OR col_min >= 'const_prefiy'
   (notice the last character of the rightmost constant is different)

Copy link
Contributor Author

@UBarney UBarney Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not true either

Why? Does it prune row groups that contain data that does not match the pattern? Or is it just inefficient?

First, let's clarify one point: if a row group contains any data that does not match the pattern, then this row group must not be pruned. The expected behavior is to return all data that does not match. If the row group gets pruned, data loss will occur (see this PR).

For `col NOT LIKE 'const_prefix%'`:  
It applies the condition:  
`col_max < 'const_prefix' OR col_min >= 'const_prefiy'`  
(Note that the last character of the rightmost constant is different.)  

I think this mistakenly prunes row groups that contain data not matching the pattern.

Consider this case: col NOT LIKE 'const_prefix%'. The row group might include values such as ["aaa", "b", "const_prefix"]. The condition col_max < 'const_prefix' OR col_min >= 'const_prefiy' evaluates to false, causing the row group to be pruned. However, this is incorrect because "aaa" does not match the pattern and should be included in the result.


let min_column_expr = expr_builder.min_column_expr()?;
let max_column_expr = expr_builder.max_column_expr()?;

let scalar_expr = expr_builder.scalar_expr();

let pattern = extract_string_literal(scalar_expr).ok_or_else(|| {
plan_datafusion_err!("cannot extract literal from NOT LIKE expression")
})?;

let (const_prefix, remaining) = split_constant_prefix(pattern);
if const_prefix.is_empty() || remaining != "%" {
// we can not handle `%` at the beginning or in the middle of the pattern
// Example: For pattern "foo%bar", the row group might include values like
// ["foobar", "food", "foodbar"], making it unsafe to prune.
// Even if the min/max values in the group (e.g., "foobar" and "foodbar")
// match the pattern, intermediate values like "food" may not
// match the full pattern "foo%bar", making pruning unsafe.
// (truncate foo%bar to foo% have same problem)

// we can not handle pattern containing `_`
// Example: For pattern "foo_", row groups might contain ["fooa", "fooaa", "foob"],
// which means not every row is guaranteed to match the pattern.
Comment on lines +1735 to +1745
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good explanation for the code we didn't write (or we did, but it was deleted).
This can be written more teresly as

Suggested change
// we can not handle `%` at the beginning or in the middle of the pattern
// Example: For pattern "foo%bar", the row group might include values like
// ["foobar", "food", "foodbar"], making it unsafe to prune.
// Even if the min/max values in the group (e.g., "foobar" and "foodbar")
// match the pattern, intermediate values like "food" may not
// match the full pattern "foo%bar", making pruning unsafe.
// (truncate foo%bar to foo% have same problem)
// we can not handle pattern containing `_`
// Example: For pattern "foo_", row groups might contain ["fooa", "fooaa", "foob"],
// which means not every row is guaranteed to match the pattern.
// We handle `NOT LIKE '<constant-prefix>%'` case only, because only this case
// can be converted to range predicates that can be then compared with row group min/max values.

return Err(plan_datafusion_err!(
"NOT LIKE expressions only support constant_prefix+wildcard`%`"
));
}

let min_col_not_like_epxr = Arc::new(phys_expr::LikeExpr::new(
true,
false,
Arc::clone(&min_column_expr),
Arc::clone(scalar_expr),
Comment on lines +1751 to +1755
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be

col_min >= 'const_prefiy'

(and renamed)

));

let max_col_not_like_expr = Arc::new(phys_expr::LikeExpr::new(
true,
false,
Arc::clone(&max_column_expr),
Arc::clone(scalar_expr),
));
Comment on lines +1758 to +1763
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be

col_max < 'const_prefix'


Ok(Arc::new(phys_expr::BinaryExpr::new(
min_col_not_like_epxr,
Operator::Or,
max_col_not_like_expr,
)))
}

/// Returns unescaped constant prefix of a LIKE pattern (possibly empty) and the remaining pattern (possibly empty)
fn split_constant_prefix(pattern: &str) -> (&str, &str) {
let char_indices = pattern.char_indices().collect::<Vec<_>>();
for i in 0..char_indices.len() {
let (idx, char) = char_indices[i];
if char == '%' || char == '_' {
if i != 0 && char_indices[i - 1].1 == '\\' {
// ecsaped by `\`
continue;
}
return (&pattern[..idx], &pattern[idx..]);
}
}
(pattern, "")
}

/// Increment a UTF8 string by one, returning `None` if it can't be incremented.
/// This makes it so that the returned string will always compare greater than the input string
/// or any other string with the same prefix.
Expand Down Expand Up @@ -4061,6 +4136,132 @@ mod tests {
prune_with_expr(expr, &schema, &statistics, expected_ret);
}

#[test]
fn prune_utf8_not_like_one() {
let (schema, statistics) = utf8_setup();

let expr = col("s1").not_like(lit("A\u{10ffff}_"));
#[rustfmt::skip]
let expected_ret = &[
// s1 ["A", "Z"] ==> some rows could pass (must keep)
true,
// s1 ["A", "L"] ==> some rows could pass (must keep)
true,
// s1 ["N", "Z"] ==> some rows could pass (must keep)
true,
// s1 ["M", "M"] ==> some rows could pass (must keep)
true,
// s1 [NULL, NULL] ==> unknown (must keep)
true,
// s1 ["A", NULL] ==> some rows could pass (must keep)
true,
// s1 ["", "A"] ==> some rows could pass (must keep)
true,
// s1 ["", ""] ==> some rows could pass (must keep)
true,
// s1 ["AB", "A\u{10ffff}\u{10ffff}\u{10ffff}"] ==> some rows could pass (must keep)
true,
// s1 ["A\u{10ffff}\u{10ffff}", "A\u{10ffff}\u{10ffff}"] ==> no row match. (min, max) maybe truncate
// orignal (min, max) maybe ("A\u{10ffff}\u{10ffff}\u{10ffff}", "A\u{10ffff}\u{10ffff}\u{10ffff}\u{10ffff}")
true,
];
prune_with_expr(expr, &schema, &statistics, expected_ret);
}

#[test]
fn prune_utf8_not_like_many() {
let (schema, statistics) = utf8_setup();

let expr = col("s1").not_like(lit("A\u{10ffff}%"));
#[rustfmt::skip]
let expected_ret = &[
// s1 ["A", "Z"] ==> some rows could pass (must keep)
true,
// s1 ["A", "L"] ==> some rows could pass (must keep)
true,
// s1 ["N", "Z"] ==> some rows could pass (must keep)
true,
// s1 ["M", "M"] ==> some rows could pass (must keep)
true,
// s1 [NULL, NULL] ==> unknown (must keep)
true,
// s1 ["A", NULL] ==> some rows could pass (must keep)
true,
// s1 ["", "A"] ==> some rows could pass (must keep)
true,
// s1 ["", ""] ==> some rows could pass (must keep)
true,
// s1 ["AB", "A\u{10ffff}\u{10ffff}\u{10ffff}"] ==> some rows could pass (must keep)
true,
// s1 ["A\u{10ffff}\u{10ffff}", "A\u{10ffff}\u{10ffff}"] ==> no row match
false,
];
prune_with_expr(expr, &schema, &statistics, expected_ret);

let expr = col("s1").not_like(lit("A\u{10ffff}%\u{10ffff}"));
#[rustfmt::skip]
let expected_ret = &[
// s1 ["A", "Z"] ==> some rows could pass (must keep)
true,
// s1 ["A", "L"] ==> some rows could pass (must keep)
true,
// s1 ["N", "Z"] ==> some rows could pass (must keep)
true,
// s1 ["M", "M"] ==> some rows could pass (must keep)
true,
// s1 [NULL, NULL] ==> unknown (must keep)
true,
// s1 ["A", NULL] ==> some rows could pass (must keep)
true,
// s1 ["", "A"] ==> some rows could pass (must keep)
true,
// s1 ["", ""] ==> some rows could pass (must keep)
true,
// s1 ["AB", "A\u{10ffff}\u{10ffff}\u{10ffff}"] ==> some rows could pass (must keep)
true,
// s1 ["A\u{10ffff}\u{10ffff}", "A\u{10ffff}\u{10ffff}"] ==> some rows could pass (must keep)
true,
];
prune_with_expr(expr, &schema, &statistics, expected_ret);

let expr = col("s1").not_like(lit("A\u{10ffff}%\u{10ffff}_"));
#[rustfmt::skip]
let expected_ret = &[
// s1 ["A", "Z"] ==> some rows could pass (must keep)
true,
// s1 ["A", "L"] ==> some rows could pass (must keep)
true,
// s1 ["N", "Z"] ==> some rows could pass (must keep)
true,
// s1 ["M", "M"] ==> some rows could pass (must keep)
true,
// s1 [NULL, NULL] ==> unknown (must keep)
true,
// s1 ["A", NULL] ==> some rows could pass (must keep)
true,
// s1 ["", "A"] ==> some rows could pass (must keep)
true,
// s1 ["", ""] ==> some rows could pass (must keep)
true,
// s1 ["AB", "A\u{10ffff}\u{10ffff}\u{10ffff}"] ==> some rows could pass (must keep)
true,
// s1 ["A\u{10ffff}\u{10ffff}", "A\u{10ffff}\u{10ffff}"] ==> some rows could pass (must keep)
true,
];
prune_with_expr(expr, &schema, &statistics, expected_ret);

let expr = col("s1").not_like(lit("A\\%%"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add some negative tests that verify predicates like NOT LIKE 'foo%bar' and NOT LIKE 'foo%bar_ (aka non empty additional patterns) can not be used to prune anything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add not like A\u{10ffff}%\u{10ffff}_. I think existing testcase col("s1").not_like(lit("A\u{10ffff}%\u{10ffff}")) is equivalent to NOT LIKE 'foo%bar'

let statistics = TestStatistics::new().with(
"s1",
ContainerStats::new_utf8(
vec![Some("A%a"), Some("A")],
vec![Some("A%c"), Some("A")],
),
);
let expected_ret = &[false, true];
prune_with_expr(expr, &schema, &statistics, expected_ret);
}

#[test]
fn test_rewrite_expr_to_prunable() {
let schema = Schema::new(vec![Field::new("a", DataType::Int32, true)]);
Expand Down
Loading