Skip to content

[C++] min/max not deterministic if Parquet files have multiple row groups #20300

@asfimport

Description

@asfimport

The following code produces non-deterministic result for getting the minimum value of a sequence of 1e5 and 1e6 integers.

sapply(1:100, function(x) {
  # create parquet file with a val column with numbers 1 to 100,000
  arrow::write_parquet(
    data.frame(val = 1:1e5), "test.parquet")
  # find minimum value
  arrow::open_dataset("test.parquet") %>%
    dplyr::summarise(min_val = min(val)) %>%
    dplyr::collect() %>% dplyr::pull(min_val)
}) %>% table()

sapply(1:100, function(x) {
  # create parquet file with a val column with numbers 1 to 1,000,000
  arrow::write_parquet(
    data.frame(val = 1:1e6), "test.parquet")
  # find minimum value
  arrow::open_dataset("test.parquet") %>%
    dplyr::summarise(min_val = min(val)) %>%
    dplyr::collect() %>% dplyr::pull(min_val)
}) %>% table()

The first 100 simulations using numbers 1 to 1e5 is able to find the minimum number (1) all 100 times.

The second 100 simulations using numbers 1 to 1e6 only finds the minimum number (1) 65 out of 100 times. It finds near multiples of 131073, 25, 8, and 2 times respectively.

. 1
100 
. 1 131073 262145 393217 
 65     25      8      2 

 

Environment: $ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.4 LTS
Release: 20.04
Codename: focal
Reporter: Robert On
Assignee: Aldrin Montana / @drin

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-16904. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions