-
Notifications
You must be signed in to change notification settings - Fork 4k
Open
Milestone
Description
This issue is present in pyarrow v7.0.0, but not in v6.0.1.
Pyarrow errors when attempting to read from a parquet file with an empty filter on a string and categorical column. These are columns "E" and "F". Interestingly the issue is not present in v7.0.0 when reading from a float, timestamp or integer column ("A" through "D").
The following Python code presents a minimal example which reproduces the issue:
import pandas as pd
import numpy as np
path = './example_df.parquet'
df = pd.DataFrame(
{
"A": 1.0,
"B": pd.Timestamp("20130102"),
"C": pd.Series(1, index=list(range(4)), dtype="float32"),
"D": np.array([3] * 4, dtype="int32"),
"E": pd.Categorical(["test", "train", "test", "train"]),
"F": "foo",
}
)
df.to_parquet(path)
# Works!
df_read = pd.read_parquet(
path,
filters=[
[
("A", "in", set())
]
]
)
# Pyarrow v6.0.1 and v7.0.0
#
# Empty DataFrame
# Columns: [A, B, C, D, E, F]
# Index: []
print(df_read)
# Fails!
df_read = pd.read_parquet(
path,
filters=[
[
("F", "in", set())
]
]
)
# Pyarrow v6.0.1
#
# Empty DataFrame
# Columns: [A, B, C, D, E, F]
# Index: []
# Pyarrow v7.0.0
#
# pyarrow.lib.ArrowInvalid: Array type didn't match type of values set: string vs null
print(df_read) Environment: pandas 1.3.5
pyarrow 7.0.0
python 3.10.4
Reporter: Damian Barabonkov / @DamianBarabonkovQC
Note: This issue was originally created as ARROW-16045. Please see the migration documentation for further details.