-
Notifications
You must be signed in to change notification settings - Fork 4k
Open
Description
Describe the bug, including details regarding any error messages, version, and platform.
The PyArrow documentation suggests that the exclude_invalid_files parameter defaults to True for the dataset() function, but in practice, it appears to default to False. This causes the function to fail when encountering invalid Parquet files instead of skipping them.
Here is a script to reproduce the issue, courtesy of an AI assistant:
"""
PyArrow Dataset Bug Report: exclude_invalid_files parameter default
Issue: PyArrow documentation states that exclude_invalid_files defaults to True,
but testing shows it behaves as if the default is False.
"""
import os
import shutil
import pyarrow as pa
import pyarrow.dataset as ds
import pyarrow.parquet as pq
# Setup test directory
test_dir = "/tmp/pyarrow_test"
os.makedirs(test_dir, exist_ok=True)
# Create a valid Parquet file
valid_data = pa.table({"a": [1, 2, 3]})
pq.write_table(valid_data, f"{test_dir}/valid.parquet")
# Create an invalid "Parquet" file
with open(f"{test_dir}/invalid.parquet", "w") as f:
f.write("This is not a valid Parquet file")
print("Test setup complete. Testing dataset() with invalid files...")
# Test 1: Without specifying exclude_invalid_files (should use default)
try:
print("\nTEST 1: Using default parameter (documentation says default=True)")
dataset = ds.dataset(test_dir, format="parquet")
print("✓ Success! Dataset created with invalid files ignored")
print(f"Files found: {len(dataset.files)}")
except Exception as e:
print(f"✗ Failed! Error: {e}")
print("This indicates exclude_invalid_files actually defaults to False")
# Test 2: Explicitly set exclude_invalid_files=True
try:
print("\nTEST 2: With exclude_invalid_files=True")
dataset = ds.dataset(test_dir, format="parquet", exclude_invalid_files=True)
print("✓ Success! Dataset created with invalid files ignored")
print(f"Files found: {len(dataset.files)}")
except Exception as e:
print(f"✗ Failed! Error: {e}")
# Test 3: Explicitly set exclude_invalid_files=False
try:
print("\nTEST 3: With exclude_invalid_files=False")
dataset = ds.dataset(test_dir, format="parquet", exclude_invalid_files=False)
# We don't expect this to succeed, but if it does:
print("✓ Success! Dataset created despite invalid files")
except Exception as e:
print(f"✗ Failed as expected when handling invalid files: {type(e).__name__}")
# Cleanup
print("\nCleaning up test directory")
shutil.rmtree(test_dir)
print("""
Bug Report Conclusion:
----------------------
If Test 1 failed but Test 2 succeeded, this confirms that exclude_invalid_files
actually defaults to False, contrary to what the documentation suggests.
This is a documentation bug at minimum, and possibly a behavioral bug if the
intention was for the parameter to default to True.
""")
Component(s)
Python