-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Is your feature request related to a problem or challenge?
In InfluxDB we use Dictionary(Int32, Utf8) columns a lot.
Queries like this (with string constants) work great and are very fast
SELECT ... WHERE column = '1'Queries like this (note 1 is an integer, not a '1') the query goes very slow
SELECT ... WHERE column = 1@erratic-pattern and I tracked this down to an issue/ limitation in type coercion:
Reproducer
DataFusion CLI v37.1.0
> create table test as values (arrow_cast('1', 'Dictionary(Int32, Utf8)'));
0 row(s) fetched.
Elapsed 0.010 seconds.
> select arrow_typeof(column1) from test;
+----------------------------+
| arrow_typeof(test.column1) |
+----------------------------+
| Dictionary(Int32, Utf8) |
+----------------------------+
1 row(s) fetched.
Elapsed 0.002 seconds.
> explain SELECT * from test where column1 = 1;
+---------------+---------------------------------------------------+
| plan_type | plan |
+---------------+---------------------------------------------------+
| logical_plan | Filter: CAST(test.column1 AS Utf8) = Utf8("1") |
| | TableScan: test projection=[column1] |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192 |
| | FilterExec: CAST(column1@0 AS Utf8) = 1 |
| | MemoryExec: partitions=1, partition_sizes=[1] |
| | |
+---------------+---------------------------------------------------+
2 row(s) fetched.
Elapsed 0.003 seconds.I think this shows the core problem:
| logical_plan | Filter: CAST(test.column1 AS Utf8) = Utf8("1") |
It basically shows the column is being converted to a string, rather than the constant being converted to th ecorrect type.
Not only does this mean the column is being un-encoded for the comparsion, it also means that PruningPredicate doesn't work either
Describe the solution you'd like
I would like the query to go fast lol
Specifically, I think the filter should look like this (no cast on the column, and instead the constant type matches)
| logical_plan | Filter: test.column1 = Dictionary(Int32, Utf8("1")) |
Note this is what happens if you compare the dictionary column to a string literal:
> explain SELECT * from test where column1 = '1';
+---------------+-----------------------------------------------------+
| plan_type | plan |
+---------------+-----------------------------------------------------+
| logical_plan | Filter: test.column1 = Dictionary(Int32, Utf8("1")) |
| | TableScan: test projection=[column1] |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192 |
| | FilterExec: column1@0 = 1 |
| | MemoryExec: partitions=1, partition_sizes=[1] |
| | |
+---------------+-----------------------------------------------------+
2 row(s) fetched.
Elapsed 0.002 seconds.
>Describe alternatives you've considered
We could potentially update the coercion logic to coerce 1 to Dictionary(.. "1") or maybe update the unwrap_comparsion logic
Additional context
No response