-
Notifications
You must be signed in to change notification settings - Fork 4k
Open
Description
The objective is to list down a set of tasks required to provide UDF support for Apache Arrow streaming execution engine. In the first iteration we will be focusing on providing support for Python-based UDFs which can support Python functions.
The UDF Integration is going to pan out with a series of sub-tasks associated with the development and PoCs. Note that this is going to be the first iteration of UDF integrations with a limited scope. This ticket will cover the following topics;
- POC for UDF integration: The objective is to evaluate the existing components in the source and evaluate the required modifications and new building blocks required to integrate UDFs.
- The language will be limited to C+{}/{}Python users can register Python function as a UDF and use it with an
applymethod on Arrow Tables or provide a computation API endpoint via arrow::compute API. Note that the C+ API already provides a way to register custom functions via the function registry API. At the moment this is not exposed to Python. - Planned features for this ticket are;
- Scalar UDFs : UDFs executed per value (per row)
- Vector UDFs : UDFs executed per batch (a full array or partial array)
- Aggregate UDFs : UDFs associated with an aggregation operation
- Integration limitations
- Doesn't support custom data types which doesn't support Numpy or Pandas
- Complex processing with parallelism within UDFs are not supported
- Parallel UDFs are not supported in the initial version of UDFs. Allthough we are documenting what is required and a rough sketch for the next phase.
Reporter: Vibhatha Lakmal Abeykoon / @vibhatha
Assignee: Vibhatha Lakmal Abeykoon / @vibhatha
Subtasks:
- PoC for UDFs
- [C++][Python] UDF Optimizations
- [Docs] UDF Documentation
- [C++][Python] UDF Scalar Function Implementation
- [Docs] UDFs on Cookbook
- [C++][Python] UDF Aggregate Function Implementation
- [C++][Python] UDF Vector Function Implementation
- [C++][Python] Include UDFOptions
- [C++][Python] Unregister compute functions
- [C++][Python] Register Multiple Kernels for a UDF
- [C++][Python] Generate scalar-argument kernel based on array-argument kernel for UDFs
- [Docs][Python] Scalar UDF Experimental Documentation
- [Python] Allow calling UDF kernels with field/scalar expressions
- [C++][Python] Implement HashAggregate UDF
PRs and other links:
Note: This issue was originally created as ARROW-15635. Please see the migration documentation for further details.