Skip to content

[Python] Custom Python type/array subclasses for ExtensionTypes implemented in C++ #33997

@jorisvandenbossche

Description

@jorisvandenbossche

When wrapping a type (or array) in a pyarrow object, we need to define which Python class to use. Currently, for extension types, this logic lives here in pyarrow_wrap_data_type:

elif type.get().id() == _Type_EXTENSION:
ext_type = <const CExtensionType*> type.get()
cpy_ext_type = dynamic_cast[_CPyExtensionTypePtr](ext_type)
if cpy_ext_type != nullptr:
return cpy_ext_type.GetInstance()
else:
out = BaseExtensionType.__new__(BaseExtensionType)

So there are currently two options:

  • The ExtensionType is implemented in Python, by subclassing pyarrow.(Py)ExtensionType, and which links to the C++ arrow::py::PyExtensionType (a subclass of arrow::ExtensionType). In this case, we store the python type instance on the C++ instance, and return this as python object in pyarrow_wrap_data_type.
  • The ExtensionType is implemented in C++, and then we currently always fall back to wrap this in the pyarrow.BaseExtenstionType base class (there is currently a bug in this, but that is getting fixed in GH-33802).

However, that means that for such extension types implemented in C++, there is currently no way to have a "richer" python Type object (or Array object, since that is determined by the Type, and for a BaseExtensionType, that will always use the base ExtensionArray). While for an extension type, you might want to add type-specific attributes or methods.

For canonical extension types that are implemented in Arrow C++ itself (for example, the currently discussed Tensor extension type in #8510, or a previous effort to add complex type as extension type in #10565), I think it will work today to create a custom subclass of pyarrow.BaseExtensionType for the specific canonical type, and then we could add a special case to pyarrow_wrap_data_type checking the name of the extension type, and if it is a canonical one we implement ourselves, use the python subclass we implemented ourselves.

But for extension types that are implemented in C++ externally (or for extension types that are implemented in Arrow C++, but for which we don't provide a custom python subclass), that doesn't work.
I am wondering to what extent we want to allow "registering" a python class that should be used when wrapping a specific C++ extension type (and to what extent this would be useful for

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions