Skip to content

Remove / simplify "impossible" predicates #8687

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

@my-vegetable-has-exploded and I had an observation on #8654 (comment) where there are certain patterns that can never be true.

For example, WHERE x IN (1,2,3) AND x IN (4,5) can never be true because the IN lists are disjoint. However, DataFusion can't prove this today and will evaluate that expression.

So for example, given

DataFusion CLI v34.0.0
❯ create table t(x int) as values (1), (2), (3);
0 rows in set. Query took 0.003 seconds.

This query:

❯ explain select x from t where x IN (1,2,3) AND x IN (4,5);
+---------------+-----------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                |
+---------------+-----------------------------------------------------------------------------------------------------+
| logical_plan  | Filter: (t.x = Int32(1) OR t.x = Int32(2) OR t.x = Int32(3)) AND (t.x = Int32(4) OR t.x = Int32(5)) |
|               |   TableScan: t projection=[x]                                                                       |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192                                                         |
|               |   FilterExec: (x@0 = 1 OR x@0 = 2 OR x@0 = 3) AND (x@0 = 4 OR x@0 = 5)                              | < ---- this filter still runs even though it will never be true
|               |     MemoryExec: partitions=1, partition_sizes=[1]                                                   |
|               |                                                                                                     |
+---------------+-----------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.003 seconds.

For other predicates that datafusion knows can not be true, the query is greatly simplified:

❯ explain select x from t where false;
+---------------+---------------+
| plan_type     | plan          |
+---------------+---------------+
| logical_plan  | EmptyRelation |
| physical_plan | EmptyExec     |
|               |               |
+---------------+---------------+

Describe the solution you'd like

It might be nice to extend the ExprSimplifier for logical expr simplification to handle these cases

Impact: I don't know how prevalent such "obviously false" expressions are in actual work loads, but my sense is if it is straightforward to prove the expression can never be true we should use that information while executing queries.

Describe alternatives you've considered

No response

Additional context

The code for LiteralGuarantees is related but works on PhysicalExpressions

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions