Skip to content

Different behavior of distinct count on null inputs #267

@viirya

Description

@viirya

Describe the bug

A few Spark tests with distinct count failed while working on #250: https://github.com/apache/arrow-datafusion-comet/actions/runs/8681807652/job/23805034669?pr=250

[info] TwoLevelAggregateHashMapSuite:
[info] - multiple column distinct count *** FAILED *** (530 milliseconds)
[info]   Results do not match for query:
...
[info]   == Results ==
[info]   !== Correct Answer - 1 ==   == Spark Answer - 1 ==
[info]   !struct<>                   struct<count(key1, key2, key3):bigint>
[info]   ![3]                        [4] (QueryTest.scala:243)

Spark distinct count aggregation doesn't count null inputs. I.e.,

override lazy val updateExpressions = {
  ..
  Seq(
    /* count = */ If(nullableChildren.map(IsNull).reduce(Or), count, count + 1L)
  )
}

But seems DataFusion count aggregation function behaves differently.

Steps to reproduce

No response

Expected behavior

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions