Skip to content

select count(distinct ..) query doesn't go to the specialized distinct accumulator #15850

@jayzhan211

Description

@jayzhan211

Is your feature request related to a problem or challenge?

statement count 0
create table t(a int) as values (1), (2);

query I
select count(distinct a) from t; 
----
2

query TT
explain
select count(distinct a) from t; 
----
logical_plan
01)Projection: count(alias1) AS count(DISTINCT t.a)
02)--Aggregate: groupBy=[[]], aggr=[[count(alias1)]]
03)----Aggregate: groupBy=[[t.a AS alias1]], aggr=[[]]
04)------TableScan: t projection=[a]
physical_plan
01)ProjectionExec: expr=[count(alias1)@0 as count(DISTINCT t.a)]
02)--AggregateExec: mode=Final, gby=[], aggr=[count(alias1)]
03)----CoalescePartitionsExec
04)------AggregateExec: mode=Partial, gby=[], aggr=[count(alias1)]
05)--------AggregateExec: mode=FinalPartitioned, gby=[alias1@0 as alias1], aggr=[]
06)----------CoalesceBatchesExec: target_batch_size=8192
07)------------RepartitionExec: partitioning=Hash([alias1@0], 4), input_partitions=4
08)--------------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
09)----------------AggregateExec: mode=Partial, gby=[a@0 as alias1], aggr=[]
10)------------------DataSourceExec: partitions=1, partition_sizes=[1]

I think we should execute with the specialized count distinct accumualator like PrimitiveDistinctCountAccumulator, BytesDistinctCountAccumulator, FloatDistinctCountAccumulator. Current execution path looks quite complex and probably not that optimized.

I expect specialized count distinct would be faster than two aggregate exec combined

Describe the solution you'd like

Investigate why distinct count accumulator is not called and whether switching to it improves the code.

ClickBench has query like count(distinct), so we could benchmark against it to see if the improvement works

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestperformanceMake DataFusion faster

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions