Pyarrow 8.0.0 write_dataset writes data in different order with use_threads=True

In the latest (8.0.0) release the following code snippet seems to write out data in a different order for each of the partitions when `use_threads=True` vs when `{}use_threads=False{`}.

Testing the same snippet with pyarrow 7.0.0 gives the same order regardless of whether `use_threads` is set to True when the data is written.

 
```java

import itertools

import numpy as np
import pyarrow.dataset as ds
import pyarrow as pa

n_rows, n_cols = 100_000, 20

def create_dataframe(color, year):
    arr = np.random.randn(n_rows, n_cols)
    df = pd.DataFrame(data=arr, columns=[f"column_{i}" for i in range(n_cols)])
    df["color"] = color
    df["year"] = year
    df["id"] = np.arange(len(df))
    return df


partitions = ["red", "green", "blue"]
years = [2011, 2012, 2013]
dataframes = [create_dataframe(p, y) for p, y in itertools.product(partitions, years)]
df = pd.concat(dataframes)

table = pa.Table.from_pandas(df=df)

ds.write_dataset(
    table,
    "./test",
    format="parquet",
    max_rows_per_group=1_000_000,
    min_rows_per_group=1_000_000,
    existing_data_behavior="overwrite_or_ignore",
    partitioning=ds.partitioning(pa.schema([
        ("color", pa.string()),
        ("year", pa.int64())
    ]), flavor="hive"),
    use_threads=True,
)

df_read = pd.read_parquet("./test/color=blue/year=2012")
df_read.head()[["id"]]

```
 

Tested on Ubuntu 20.04 with Python 3.8 and arrow versions 8.0.0 and 7.0.0.

**Reporter**: [Daniel Friar](https://issues.apache.org/jira/browse/ARROW-16506)
#### Related issues:
- [[C++][Dataset] Preserve order when writing dataset](https://github.com/apache/arrow/issues/26818) (duplicates)

<sub>**Note**: *This issue was originally created as [ARROW-16506](https://issues.apache.org/jira/browse/ARROW-16506). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pyarrow 8.0.0 write_dataset writes data in different order with use_threads=True #31870

Related issues:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pyarrow 8.0.0 write_dataset writes data in different order with use_threads=True #31870

Description

Related issues:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions