Skip to content

[Python] Table.join() produces incorrect results for large inputs #34474

@ericlin4

Description

@ericlin4

Describe the bug, including details regarding any error messages, version, and platform.

Pyarrow's join does not produce the same results as Pandas when the input tables are large. I am observing this in industry data that I am working with, and I have a reproducible example below that mimics this data.

In this example, we have 72 million unique rows in each table with 9 join key columns of various types. The tables are identical except for the 'val' column in the second table.

Pyarrow's join creates null values for 'val' where there should be actual values from the second table. The join performed in
Pandas produces the expected result.

I can produce the same result as Pandas by splitting each table into pieces, joining each left piece to each right piece, coalescing 'val', and concatenating the outputs (e.g., pa.concat([tbl1_a.join(tbl2_a).join(tbl2_b), tbl1_b.join(tbl2_a).join(tbl2_b)])).

Apologies for the long-running example. The first section that generates the join key data takes about an hour on my machine (AWS r5.24xlarge EC2 instance) with the rest taking about 30 minutes. Around 100GB of memory is necessary to run the code.

import pyarrow as pa #11.0.0
import pandas as pd  #1.5.3
import numpy as np   #1.23.5

#Generate join key data
n_rows = 72000000
join_keys = [f'col{i}' for i in range(1,10)]

col_str = [str(i) for i in range(n_rows)]
col_date = [pd.to_datetime('2000-01-01') for i in range(n_rows)]
col_int = [i for i in range(n_rows)]

#Create dataframes -- df1 and df2 are identical except for the the 'val' column
df1 = pd.DataFrame({'col1': col_str,
                    'col2': col_str,
                    'col3': col_str,
                    'col4': col_str,
                    'col5': col_str,
                    'col6': col_date,
                    'col7': col_date,
                    'col8': col_date,
                    'col9': col_int})

df2 = pd.DataFrame({'col1': col_str,
                    'col2': col_str,
                    'col3': col_str,
                    'col4': col_str,
                    'col5': col_str,
                    'col6': col_date,
                    'col7': col_date,
                    'col8': col_date,
                    'col9': col_int,
                    'val': [i for i in range(n_rows - 10000000)] + [np.nan for i in range(10000000)]})

#Create Pyarrow Tables and merge
df1_pa = pa.Table.from_pandas(df1)
df2_pa = pa.Table.from_pandas(df2)

merge_pa = df1_pa.join(df2_pa, keys = join_keys, join_type = 'left outer')
merge_pa_df = merge_pa.to_pandas()

#Merge dataframes analogously in Pandas
merge_pd_df = pd.merge(df1, df2, on = join_keys, how = 'left')

#Compare results -- should have the same number of non-null values in 'val'
print(f"Pyarrow join non-null rows: {sum(merge_pa_df['val'].notnull())}")
print(f"Pandas merge non-null rows: {sum(merge_pd_df['val'].notnull())}")

#Returns
#"Pyarrow join non-null rows: 37317087" (also changes from run to run)
#"Pandas merge non-null rows: 62000000"

#Expected
#"Pyarrow join non-null rows: 62000000"
#"Pandas merge non-null rows: 62000000"



#Example row of unexpected output
#merge_pd_df.rename({'val':'val_pd'}, axis = 1, inplace = True)
#merge_pa_df.rename({'val':'val_pa'}, axis = 1, inplace = True)
#comp = pd.merge(merge_pd_df, merge_pa_df, on = join_keys, how = 'left')

#col1   col2   col3   col4   col5   col6       col7       col8       col9   val_pd   val_pa
#0      0      0      0      0      2000-01-01 2000-01-01 2000-01-01 0      0.0      NaN

Component(s)

Python

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions