-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Is your feature request related to a problem or challenge?
@my-vegetable-has-exploded and I had an observation on #8654 (comment) where there are certain patterns that can never be true.
For example, WHERE x IN (1,2,3) AND x IN (4,5) can never be true because the IN lists are disjoint. However, DataFusion can't prove this today and will evaluate that expression.
So for example, given
DataFusion CLI v34.0.0
❯ create table t(x int) as values (1), (2), (3);
0 rows in set. Query took 0.003 seconds.This query:
❯ explain select x from t where x IN (1,2,3) AND x IN (4,5);
+---------------+-----------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+-----------------------------------------------------------------------------------------------------+
| logical_plan | Filter: (t.x = Int32(1) OR t.x = Int32(2) OR t.x = Int32(3)) AND (t.x = Int32(4) OR t.x = Int32(5)) |
| | TableScan: t projection=[x] |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192 |
| | FilterExec: (x@0 = 1 OR x@0 = 2 OR x@0 = 3) AND (x@0 = 4 OR x@0 = 5) | < ---- this filter still runs even though it will never be true
| | MemoryExec: partitions=1, partition_sizes=[1] |
| | |
+---------------+-----------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.003 seconds.
For other predicates that datafusion knows can not be true, the query is greatly simplified:
❯ explain select x from t where false;
+---------------+---------------+
| plan_type | plan |
+---------------+---------------+
| logical_plan | EmptyRelation |
| physical_plan | EmptyExec |
| | |
+---------------+---------------+
Describe the solution you'd like
It might be nice to extend the ExprSimplifier for logical expr simplification to handle these cases
Impact: I don't know how prevalent such "obviously false" expressions are in actual work loads, but my sense is if it is straightforward to prove the expression can never be true we should use that information while executing queries.
Describe alternatives you've considered
No response
Additional context
The code for LiteralGuarantees is related but works on PhysicalExpressions