Skip to content

Support Count(Distinct) (and similar) aggregation #38

@sunchao

Description

@sunchao

What is the problem the feature request solves?

We should also support aggregations such as count(distinct(col)) from tbl. In Spark,

SELECT COUNT(DISTINCT(_1)) FROM tbl

produces a plan like the following:

AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[], functions=[count(distinct _1#9)], output=[count(DISTINCT _1)#16L])
   +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=40]
      +- HashAggregate(keys=[], functions=[partial_count(distinct _1#9)], output=[count#20L])
         +- HashAggregate(keys=[_1#9], functions=[], output=[_1#9])
            +- Exchange hashpartitioning(_1#9, 5), ENSURE_REQUIREMENTS, [plan_id=37]
               +- HashAggregate(keys=[_1#9], functions=[], output=[_1#9])
                  +- Scan parquet [_1#9] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_1:int>

Describe the potential solution

Add the support for COUNT(DISTINCT) (and similar), so the Spark physical plan can be properly converted to a native plan and executed.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions