Skip to content

[Python] Can not refer to field in a list of structs  #32794

@asfimport

Description

@asfimport

When the dataset has nested sturcts, "list",  we can not use pyarrow.field(..) to get the reference of the sub-field of the struct.

 

For example

 

import pyarrow as pa
import pyarrow.dataset as ds
import pandas as pd

schema = pa.schema(
    [
        pa.field(
            "objects",
            pa.list_(
                pa.struct(
                    [
                        pa.field("name", pa.utf8()),
                        pa.field("attr1", pa.float32()),
                        pa.field("attr2", pa.int32()),
                    ]
                )
            ),
        )
    ]
)

table = pa.Table.from_pandas(
    pd.DataFrame([{"objects": [{"name": "a", "attr1": 5.0, "attr2": 20}]}])
)
print(table)

dataset = ds.dataset(table)
print(dataset)
dataset.scanner(columns=["objects.attr2"]).to_table()

which throws exception:


Traceback (most recent call last):
  File "foo.py", line 31, in <module>
    dataset.scanner(columns=["objects.attr2"]).to_table()
  File "pyarrow/_dataset.pyx", line 298, in pyarrow._dataset.Dataset.scanner
  File "pyarrow/_dataset.pyx", line 2356, in pyarrow._dataset.Scanner.from_dataset
  File "pyarrow/_dataset.pyx", line 2202, in pyarrow._dataset._populate_builder
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: No match for FieldRef.Name(objects.attr2) in objects: list<item: struct<attr1: double, attr2: int64, name: string>>
__fragment_index: int32
__batch_index: int32
__last_in_fragment: bool
__filename: string

Reporter: Lei (Eddy) Xu

Related issues:

Note: This issue was originally created as ARROW-17540. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions