Skip to content

Pyarrow 8.0.0 write_dataset writes data in different order with use_threads=True #31870

@asfimport

Description

@asfimport

In the latest (8.0.0) release the following code snippet seems to write out data in a different order for each of the partitions when use_threads=True vs when {}use_threads=False{}.

Testing the same snippet with pyarrow 7.0.0 gives the same order regardless of whether use_threads is set to True when the data is written.

 

import itertools

import numpy as np
import pyarrow.dataset as ds
import pyarrow as pa

n_rows, n_cols = 100_000, 20

def create_dataframe(color, year):
    arr = np.random.randn(n_rows, n_cols)
    df = pd.DataFrame(data=arr, columns=[f"column_{i}" for i in range(n_cols)])
    df["color"] = color
    df["year"] = year
    df["id"] = np.arange(len(df))
    return df


partitions = ["red", "green", "blue"]
years = [2011, 2012, 2013]
dataframes = [create_dataframe(p, y) for p, y in itertools.product(partitions, years)]
df = pd.concat(dataframes)

table = pa.Table.from_pandas(df=df)

ds.write_dataset(
    table,
    "./test",
    format="parquet",
    max_rows_per_group=1_000_000,
    min_rows_per_group=1_000_000,
    existing_data_behavior="overwrite_or_ignore",
    partitioning=ds.partitioning(pa.schema([
        ("color", pa.string()),
        ("year", pa.int64())
    ]), flavor="hive"),
    use_threads=True,
)

df_read = pd.read_parquet("./test/color=blue/year=2012")
df_read.head()[["id"]]

 

Tested on Ubuntu 20.04 with Python 3.8 and arrow versions 8.0.0 and 7.0.0.

Reporter: Daniel Friar

Related issues:

Note: This issue was originally created as ARROW-16506. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions