-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Describe the bug, including details regarding any error messages, version, and platform.
Pyarrow's join does not produce the same results as Pandas when the input tables are large. I am observing this in industry data that I am working with, and I have a reproducible example below that mimics this data.
In this example, we have 72 million unique rows in each table with 9 join key columns of various types. The tables are identical except for the 'val' column in the second table.
Pyarrow's join creates null values for 'val' where there should be actual values from the second table. The join performed in
Pandas produces the expected result.
I can produce the same result as Pandas by splitting each table into pieces, joining each left piece to each right piece, coalescing 'val', and concatenating the outputs (e.g., pa.concat([tbl1_a.join(tbl2_a).join(tbl2_b), tbl1_b.join(tbl2_a).join(tbl2_b)])).
Apologies for the long-running example. The first section that generates the join key data takes about an hour on my machine (AWS r5.24xlarge EC2 instance) with the rest taking about 30 minutes. Around 100GB of memory is necessary to run the code.
import pyarrow as pa #11.0.0
import pandas as pd #1.5.3
import numpy as np #1.23.5
#Generate join key data
n_rows = 72000000
join_keys = [f'col{i}' for i in range(1,10)]
col_str = [str(i) for i in range(n_rows)]
col_date = [pd.to_datetime('2000-01-01') for i in range(n_rows)]
col_int = [i for i in range(n_rows)]
#Create dataframes -- df1 and df2 are identical except for the the 'val' column
df1 = pd.DataFrame({'col1': col_str,
'col2': col_str,
'col3': col_str,
'col4': col_str,
'col5': col_str,
'col6': col_date,
'col7': col_date,
'col8': col_date,
'col9': col_int})
df2 = pd.DataFrame({'col1': col_str,
'col2': col_str,
'col3': col_str,
'col4': col_str,
'col5': col_str,
'col6': col_date,
'col7': col_date,
'col8': col_date,
'col9': col_int,
'val': [i for i in range(n_rows - 10000000)] + [np.nan for i in range(10000000)]})
#Create Pyarrow Tables and merge
df1_pa = pa.Table.from_pandas(df1)
df2_pa = pa.Table.from_pandas(df2)
merge_pa = df1_pa.join(df2_pa, keys = join_keys, join_type = 'left outer')
merge_pa_df = merge_pa.to_pandas()
#Merge dataframes analogously in Pandas
merge_pd_df = pd.merge(df1, df2, on = join_keys, how = 'left')
#Compare results -- should have the same number of non-null values in 'val'
print(f"Pyarrow join non-null rows: {sum(merge_pa_df['val'].notnull())}")
print(f"Pandas merge non-null rows: {sum(merge_pd_df['val'].notnull())}")
#Returns
#"Pyarrow join non-null rows: 37317087" (also changes from run to run)
#"Pandas merge non-null rows: 62000000"
#Expected
#"Pyarrow join non-null rows: 62000000"
#"Pandas merge non-null rows: 62000000"
#Example row of unexpected output
#merge_pd_df.rename({'val':'val_pd'}, axis = 1, inplace = True)
#merge_pa_df.rename({'val':'val_pa'}, axis = 1, inplace = True)
#comp = pd.merge(merge_pd_df, merge_pa_df, on = join_keys, how = 'left')
#col1 col2 col3 col4 col5 col6 col7 col8 col9 val_pd val_pa
#0 0 0 0 0 2000-01-01 2000-01-01 2000-01-01 0 0.0 NaN
Component(s)
Python