-
Notifications
You must be signed in to change notification settings - Fork 2k
Open
Labels
enhancementNew feature or requestNew feature or requestperformanceMake DataFusion fasterMake DataFusion faster
Description
Is your feature request related to a problem or challenge?
statement count 0
create table t(a int) as values (1), (2);
query I
select count(distinct a) from t;
----
2
query TT
explain
select count(distinct a) from t;
----
logical_plan
01)Projection: count(alias1) AS count(DISTINCT t.a)
02)--Aggregate: groupBy=[[]], aggr=[[count(alias1)]]
03)----Aggregate: groupBy=[[t.a AS alias1]], aggr=[[]]
04)------TableScan: t projection=[a]
physical_plan
01)ProjectionExec: expr=[count(alias1)@0 as count(DISTINCT t.a)]
02)--AggregateExec: mode=Final, gby=[], aggr=[count(alias1)]
03)----CoalescePartitionsExec
04)------AggregateExec: mode=Partial, gby=[], aggr=[count(alias1)]
05)--------AggregateExec: mode=FinalPartitioned, gby=[alias1@0 as alias1], aggr=[]
06)----------CoalesceBatchesExec: target_batch_size=8192
07)------------RepartitionExec: partitioning=Hash([alias1@0], 4), input_partitions=4
08)--------------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
09)----------------AggregateExec: mode=Partial, gby=[a@0 as alias1], aggr=[]
10)------------------DataSourceExec: partitions=1, partition_sizes=[1]
I think we should execute with the specialized count distinct accumualator like PrimitiveDistinctCountAccumulator, BytesDistinctCountAccumulator, FloatDistinctCountAccumulator. Current execution path looks quite complex and probably not that optimized.
I expect specialized count distinct would be faster than two aggregate exec combined
Describe the solution you'd like
Investigate why distinct count accumulator is not called and whether switching to it improves the code.
ClickBench has query like count(distinct), so we could benchmark against it to see if the improvement works
Describe alternatives you've considered
No response
Additional context
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestperformanceMake DataFusion fasterMake DataFusion faster