-
Notifications
You must be signed in to change notification settings - Fork 4k
Closed
Description
Testing new feature from ARROW-8647, python test that reproduces it:
@pytest.mark.parquet
@pytest.mark.parametrize('partitioning', ["directory", "hive"])
def test_open_dataset_partitioned_dictionary_type(tempdir, partitioning):
import pyarrow.parquet as pq
table = pa.table({'a': range(9), 'b': [0.] * 4 + [1.] * 5})
path = tempdir / "dataset"
path.mkdir()
for part in ["A", "B", "C"]:
fmt = "{}" if partitioning == "directory" else "part={}"
part = path / fmt.format(part)
part.mkdir()
pq.write_table(table, part / "test.parquet")
if partitioning == "directory":
part = ds.DirectoryPartitioning.discover(["part"], max_partition_dictionary_size=-1)
else:
part = ds.HivePartitioning.discover(max_partition_dictionary_size=-1)
dataset = ds.dataset(str(path), partitioning=part)
expected_schema = table.schema.append(
pa.field("part", pa.dictionary(pa.int32(), pa.string()))
)
assert dataset.schema.equals(expected_schema)This test fails (segfaults) for HivePartitioning, but works for DirectoryPartitioning
Reporter: Joris Van den Bossche / @jorisvandenbossche
Assignee: Joris Van den Bossche / @jorisvandenbossche
Related issues:
PRs and other links:
Note: This issue was originally created as ARROW-9288. Please see the migration documentation for further details.