-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
When wrapping a type (or array) in a pyarrow object, we need to define which Python class to use. Currently, for extension types, this logic lives here in pyarrow_wrap_data_type:
arrow/python/pyarrow/public-api.pxi
Lines 114 to 120 in b413ac4
| elif type.get().id() == _Type_EXTENSION: | |
| ext_type = <const CExtensionType*> type.get() | |
| cpy_ext_type = dynamic_cast[_CPyExtensionTypePtr](ext_type) | |
| if cpy_ext_type != nullptr: | |
| return cpy_ext_type.GetInstance() | |
| else: | |
| out = BaseExtensionType.__new__(BaseExtensionType) |
So there are currently two options:
- The ExtensionType is implemented in Python, by subclassing
pyarrow.(Py)ExtensionType, and which links to the C++arrow::py::PyExtensionType(a subclass ofarrow::ExtensionType). In this case, we store the python type instance on the C++ instance, and return this as python object inpyarrow_wrap_data_type. - The ExtensionType is implemented in C++, and then we currently always fall back to wrap this in the
pyarrow.BaseExtenstionTypebase class (there is currently a bug in this, but that is getting fixed in GH-33802).
However, that means that for such extension types implemented in C++, there is currently no way to have a "richer" python Type object (or Array object, since that is determined by the Type, and for a BaseExtensionType, that will always use the base ExtensionArray). While for an extension type, you might want to add type-specific attributes or methods.
For canonical extension types that are implemented in Arrow C++ itself (for example, the currently discussed Tensor extension type in #8510, or a previous effort to add complex type as extension type in #10565), I think it will work today to create a custom subclass of pyarrow.BaseExtensionType for the specific canonical type, and then we could add a special case to pyarrow_wrap_data_type checking the name of the extension type, and if it is a canonical one we implement ourselves, use the python subclass we implemented ourselves.
But for extension types that are implemented in C++ externally (or for extension types that are implemented in Arrow C++, but for which we don't provide a custom python subclass), that doesn't work.
I am wondering to what extent we want to allow "registering" a python class that should be used when wrapping a specific C++ extension type (and to what extent this would be useful for