[C++] count_distinct aggregates incorrectly across row groups

When reading from parquet files with multiple row groups, `count_distinct` (wrapped by `n_distinct` in R) returns inaccurate and inconsistent results:
```r

library(dplyr, warn.conflicts = FALSE)

path <- tempfile(fileext = '.parquet')
arrow::write_parquet(dplyr::starwars, path, chunk_size = 20L)

ds <- arrow::open_dataset(path)

ds %>% count(sex) %>% collect()
#> # A tibble: 5 × 2
#>   sex                n
#>   <chr>          <int>
#> 1 male              60
#> 2 none               6
#> 3 female            16
#> 4 hermaphroditic     1
#> 5 <NA>               4

ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    19
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    16
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    16
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    17
ds %>% summarise(n = n_distinct(sex)) %>% collect()
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    17

# correct
ds %>% collect() %>% summarise(n = n_distinct(sex))
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1     5
```

If the file is stored as a single row group, results are correct. When grouped, results are correct.

I can reproduce this in Python as well using the same file and `pyarrow.compute.count_distinct`:

```python

import pyarrow as pa
import pyarrow.parquet as pq

pa.__version__
#> 8.0.0

starwars = pq.read_table('/var/folders/0j/zz6p_mjx2_b727p6xdpm5chc0000gn/T//Rtmp2wnWl5/file1744f6cc6cea8.parquet')

pa.compute.count_distinct(starwars.column('sex')).as_py()
#> 15
pa.compute.unique(starwars.column('sex'))
#> [
#>   "male",
#>   "none",
#>   "female",
#>   "hermaphroditic",
#>    null
#> ]
```

This seems likely to be the same problem in this StackOverflow question: https://stackoverflow.com/questions/72561901/how-do-i-compute-the-number-of-unique-values-in-a-pyarrow-array which is working from orc files.

**Environment**: > arrow::arrow_info()
Arrow package version: 8.0.0.9000

Capabilities:
               
dataset    TRUE
substrait FALSE
parquet    TRUE
json       TRUE
s3         TRUE
utf8proc   TRUE
re2        TRUE
snappy     TRUE
gzip       TRUE
brotli     TRUE
zstd       TRUE
lz4        TRUE
lz4_frame  TRUE
lzo       FALSE
bz2        TRUE
jemalloc   TRUE
mimalloc  FALSE

Memory:
                   
Allocator  jemalloc
Current    37.25 Kb
Max       925.42 Kb

Runtime:
                        
SIMD Level          none
Detected SIMD Level none

Build:
                                                             
C++ Library Version                            9.0.0-SNAPSHOT
C++ Compiler                                       AppleClang
C++ Compiler Version                          13.1.6.13160021
Git ID               d9d78946607f36e25e9d812a5cc956bd00ab2bc9
**Reporter**: [Edward Visel](https://issues.apache.org/jira/browse/ARROW-16807) / @alistaire47
**Assignee**: [Aldrin Montana](https://issues.apache.org/jira/browse/ARROW-16807) / @drin
#### Related issues:
- [[C++] min/max not deterministic if Parquet files have multiple row groups](https://github.com/apache/arrow/issues/20300) (is related to)
#### PRs and other links:
- [GitHub Pull Request #13583](https://github.com/apache/arrow/pull/13583)

<sub>**Note**: *This issue was originally created as [ARROW-16807](https://issues.apache.org/jira/browse/ARROW-16807). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[C++] count_distinct aggregates incorrectly across row groups #32138

Related issues:

PRs and other links:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[C++] count_distinct aggregates incorrectly across row groups #32138

Description

Related issues:

PRs and other links:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions