Skip to content

Slow comparisions to dictionary columns with type coercion #10220

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

In InfluxDB we use Dictionary(Int32, Utf8) columns a lot.

Queries like this (with string constants) work great and are very fast

SELECT ... WHERE column = '1'

Queries like this (note 1 is an integer, not a '1') the query goes very slow

SELECT ... WHERE column = 1

@erratic-pattern and I tracked this down to an issue/ limitation in type coercion:

Reproducer

DataFusion CLI v37.1.0
> create table test as values (arrow_cast('1', 'Dictionary(Int32, Utf8)'));
0 row(s) fetched.
Elapsed 0.010 seconds.

> select arrow_typeof(column1) from test;
+----------------------------+
| arrow_typeof(test.column1) |
+----------------------------+
| Dictionary(Int32, Utf8)    |
+----------------------------+
1 row(s) fetched.
Elapsed 0.002 seconds.

> explain SELECT * from test where column1 = 1;
+---------------+---------------------------------------------------+
| plan_type     | plan                                              |
+---------------+---------------------------------------------------+
| logical_plan  | Filter: CAST(test.column1 AS Utf8) = Utf8("1")    |
|               |   TableScan: test projection=[column1]            |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192       |
|               |   FilterExec: CAST(column1@0 AS Utf8) = 1         |
|               |     MemoryExec: partitions=1, partition_sizes=[1] |
|               |                                                   |
+---------------+---------------------------------------------------+
2 row(s) fetched.
Elapsed 0.003 seconds.

I think this shows the core problem:

| logical_plan  | Filter: CAST(test.column1 AS Utf8) = Utf8("1")    |

It basically shows the column is being converted to a string, rather than the constant being converted to th ecorrect type.

Not only does this mean the column is being un-encoded for the comparsion, it also means that PruningPredicate doesn't work either

Describe the solution you'd like

I would like the query to go fast lol

Specifically, I think the filter should look like this (no cast on the column, and instead the constant type matches)

| logical_plan  | Filter: test.column1 = Dictionary(Int32, Utf8("1")) |

Note this is what happens if you compare the dictionary column to a string literal:

> explain SELECT * from test where column1 = '1';
+---------------+-----------------------------------------------------+
| plan_type     | plan                                                |
+---------------+-----------------------------------------------------+
| logical_plan  | Filter: test.column1 = Dictionary(Int32, Utf8("1")) |
|               |   TableScan: test projection=[column1]              |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192         |
|               |   FilterExec: column1@0 = 1                         |
|               |     MemoryExec: partitions=1, partition_sizes=[1]   |
|               |                                                     |
+---------------+-----------------------------------------------------+
2 row(s) fetched.
Elapsed 0.002 seconds.

>

Describe alternatives you've considered

We could potentially update the coercion logic to coerce 1 to Dictionary(.. "1") or maybe update the unwrap_comparsion logic

Additional context

No response

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions