ARROW-16212: [C++][Python] Register Multiple Kernels for a UDF #14320

vibhatha · 2022-10-05T05:42:30Z

This PR includes the support for multi-kernel registration with Scalar UDFs for Python.

vibhatha · 2022-10-05T06:06:38Z

cc @pitrou @jorisvandenbossche @westonpace

Could you please review this PR?

github-actions · 2022-10-05T08:41:08Z

https://issues.apache.org/jira/browse/ARROW-16212

pitrou · 2022-10-05T08:41:20Z

@vibhatha Can you please focus on Arrow 10.0.0 release blockers for now? Thank you.

vibhatha · 2022-10-05T08:45:06Z

@vibhatha Can you please focus on Arrow 10.0.0 release blockers for now? Thank you.

Of course, thanks for the ping. This was picked up before that issue was assigned (yesterday). I just finalized this. And I am working on the one I picked yesterday.

vibhatha · 2022-12-29T04:49:00Z

cc @westonpace

westonpace

A few nits.

My main concern is that, rather than have input_types be a dictionary, I think it would be best to have the argument names as a separate field. Otherwise it just forces the user to grapple with questions like "why is it a dictionary? What am I supposed to use for the key? Can different rows have different keys?" when that could be more intuitive.

def test_multi_kernel_registration():
    def unary_function(ctx, x):
        return pc.cast(pc.call_function("multiply", [x, 2],
                                        memory_pool=ctx.memory_pool), x.type)
    func_name = "y=x*2"
    unary_doc = {"summary": "multiply by two function",
                 "description": "test multiply function"}
    input_types = [
        pa.int8(),
        pa.int16(),
        pa.int32(),
        pa.int64(),
        pa.float32(),
        pa.float64()
    ]

    input_names = ["array"]

    output_types = [
        pa.int8(),
        pa.int16(),
        pa.int32(),
        pa.int64(),
        pa.float32(),
        pa.float64()
    ]
    pc.register_scalar_function(unary_function,
                                func_name,
                                unary_doc,
                                input_types,
                                input_names,
                                output_types)

python/pyarrow/src/arrow/python/udf.cc

python/pyarrow/_compute.pyx

vibhatha · 2023-01-03T03:46:05Z

A few nits.

My main concern is that, rather than have input_types be a dictionary, I think it would be best to have the argument names as a separate field. Otherwise it just forces the user to grapple with questions like "why is it a dictionary? What am I supposed to use for the key? Can different rows have different keys?" when that could be more intuitive.

def test_multi_kernel_registration():
    def unary_function(ctx, x):
        return pc.cast(pc.call_function("multiply", [x, 2],
                                        memory_pool=ctx.memory_pool), x.type)
    func_name = "y=x*2"
    unary_doc = {"summary": "multiply by two function",
                 "description": "test multiply function"}
    input_types = [
        pa.int8(),
        pa.int16(),
        pa.int32(),
        pa.int64(),
        pa.float32(),
        pa.float64()
    ]

    input_names = ["array"]

    output_types = [
        pa.int8(),
        pa.int16(),
        pa.int32(),
        pa.int64(),
        pa.float32(),
        pa.float64()
    ]
    pc.register_scalar_function(unary_function,
                                func_name,
                                unary_doc,
                                input_types,
                                input_names,
                                output_types)

I see your point, but I have a counter argument for this (just my opinion).

If we keep a separate list, it also makes it unclear which name goes for which type. Having them together makes it readable. We had a case for aggreagtes, where having them separate made it confusing. I am not saying it is the exact same case, but close enough. I would rather propose, documenting what we do clearly with a few examples in the docstring.

…cases

westonpace · 2023-01-03T15:03:22Z

We spoke on this briefly but I will add my comment here as well. I get your point about the aggregates. Having two vectors that have to match in length is not an ideal API. However, I still think it is better than having to consistently repeat ourselves. I'm not 100% against it so if we want to keep the status quo we can.

vibhatha · 2023-01-03T15:04:57Z

We spoke on this briefly but I will add my comment here as well. I get your point about the aggregates. Having two vectors that have to match in length is not an ideal API. However, I still think it is better than having to consistently repeat ourselves. I'm not 100% against it so if we want to keep the status quo we can.

Yes that's a fair argument. I will work on it.

vibhatha · 2023-01-03T17:43:22Z

@westonpace I updated the PR and added a test case to validate the length of the input_types and input_names

vibhatha · 2023-01-03T17:43:56Z

python/pyarrow/_compute.pyx

-    out_type : DataType
-        Output type of the function.
+    in_arg_types : List[List[DataType]]
+        A list of list of DataTypes which includes input types for 


is this okay? list of list of usage...

westonpace

This seems good to me now.

This is a backwards incompatible change to register_scalar_function. It's still experimental I'd say, so this is probably ok, but I know a few are using it.

@rtpsw do you want to take a look through this change? Also, this will eventually become a merge conflict for your table UDFs PR.

python/pyarrow/src/arrow/python/udf.cc

pitrou · 2023-01-04T13:54:00Z

I have two issues here:

The API is ugly.
One cannot register different implementations for different input types. This might be an annoying limitation (and/or a performance issue if people work around it by checking types in Python).

westonpace · 2023-01-04T17:39:05Z

2. One cannot register different implementations for different input types. This might be an annoying limitation  (and/or a performance issue if people work around it by checking types in Python).

Assuming the actual UDF implementation is in C I think the "python UDF" is just mapping to the appropriate external library call and the external library will handle the type checking itself. For example, I know we had an example at one point where we registered numpy's gcd function as a UDF. Since numpy does its own type dispatch we could use the same kernel for all integral input types.

If the actual UDF implementation is in python then I think performance doesn't matter in this context (e.g. prototyping) and they are probably mapping with something like to_pylist and so one callable would be able to handle all numeric types (e.g. in C we have int32_t and float but in python those would both just be number).

1. The API is ugly.

I'm not sure how best to address this. If I start from scratch I think I'd end up with something a bit more complex but hopefully use friendly (note, this proposal addresses your point 2 as well):

class Udf(object):
    """
    A user defined function that can be registered with pyarrow so that
    it can be used in expression evaluation.

    Parameters
    ----------
    name : str
        The name of the function, function names must be unique
    kernels: List[UdfKernel]
        One or more kernels which map combinations of input types
        to python callables that provide the function implementation
    doc : UdfDocumentation
        Optional object describing the function.  Can be used in
        interactive contexts to provide help to a user creating expressions

class UdfDocumentation(object):
    """
    Documentation that describes a user defined function.  This can be
    displayed as helpful information to users that are authoring compute
    expressions interactively.

    summary: str, optional
        A brief summary of the function, defaults to the function name
    description: str, optional
        An extended description that can be used to describe function details.
        Defaults to an empty string.
    arg_names: List[str], optional
        Names to give the arguments.  Defaults to ["arg0", "arg1", ...]

class UdfKernel(object):
    """
    A mapping from input types to a python callable.  The same python callable
    can be used in multiple `UdfKernel` instances if the callable can handle arrays
    of different types.

    During execution of an expression Arrow will pick the appropriate kernel based
    on the types of the arguments.  A kernel will only be selected if the kernel's
    input types match the argument types exactly.

    All kernels within a Udf must have the same number of input types.
    """
    input_types: List[DataType]
        The input types expected by `exec`
    output_type: DataType
        The output type of the array returned by the `exec`
    exec: Callable
        The python callable to execute when this kernel is invoked
        <description of how the callable will be invoked>

The example would then become:

def test_multi_kernel_registration():
    def unary_function(ctx, x):
        return pc.cast(pc.call_function("multiply", [x, 2],
                                        memory_pool=ctx.memory_pool), x.type)

    func_name = "y=x*2"
    unary_doc = {"summary": "multiply by two function",
                 "description": "test multiply function"}
    input_types = [
        pa.int8(),
        pa.int16(),
        pa.int32(),
        pa.int64(),
        pa.float32(),
        pa.float64()
    ]
    kernels = [UdfKernel([in_type], in_type, unary_function) for in_type in input_types]
    func_doc = UdfDocumentation("multiply by two function", "test multiply function", ["array"])
    udf = Udf("y=x*2", kernels, doc=func_doc)
    pc.register_scalar_function(udf)

@pitrou would that be a more favorable API or can you give more description of what you would like to change?

pitrou · 2023-01-05T16:54:52Z

One other possibility would be to first create the function and then register kernels one by one:

udf = pc.register_scalar_function(
    name="y=x*2",
    arg_names=["x"],
    doc = {"summary": "multiply by two function",
           "description": "test multiply function"}
    )
for in_type in [pa.int64(), pa.float64()]:
    udf.add_kernel([in_type], in_type, impl)

It could also open the way to a decorator-like syntax:

udf = pc.register_scalar_function(
    name="y=x*2",
    arg_names=["x"],
    doc = {"summary": "multiply by two function",
           "description": "test multiply function"}
    )

@udf.kernel([pa.float64()], pa.float64())
@udf.kernel([pa.int64()], pa.int64())
def impl(ctx, x):
    # ...

@jorisvandenbossche Opinions on which kind of API would be desirable here?

vibhatha · 2023-01-23T03:21:25Z

@pitrou what approach should we take here? I completely understand your concern and will do the necessary changes to make this better.

Also one proposal, this API is still experimental, do you think it is the best to improve it right now or do it in a follow up PR.

amol- · 2023-03-30T17:09:57Z

Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍

github-actions bot added the Component: Python label Oct 5, 2022

pitrou changed the title ~~ARROW-16212 [C++][Python] Register Multiple Kernels for a UDF~~ ARROW-16212: [C++][Python] Register Multiple Kernels for a UDF Oct 5, 2022

vibhatha force-pushed the arrow-16212 branch from 02b25db to e0f51a2 Compare October 11, 2022 05:59

westonpace requested changes Jan 2, 2023

View reviewed changes

python/pyarrow/src/arrow/python/udf.cc Outdated Show resolved Hide resolved

python/pyarrow/_compute.pyx Outdated Show resolved Hide resolved

vibhatha added 7 commits January 3, 2023 19:25

feat(multi-kernel): added multi kernel registration and updated test …

94eff67

…cases

fix(docs): updated args

bf126b3

fix(docs): fixed error messages

65f5772

fix(minor): naming exception

caabab4

fix(docs): update docstring

52b7095

fix(typo): fixed minor typos

9a5d521

fix(rebase)

43ec148

fix(reviews): addressing reviews for input_type passing

90792bf

vibhatha force-pushed the arrow-16212 branch from e0f51a2 to 90792bf Compare January 3, 2023 17:31

feat(test): adding test case for mismatch input types and args

4ea2ed2

vibhatha requested a review from westonpace January 3, 2023 17:42

vibhatha commented Jan 3, 2023

View reviewed changes

vibhatha added 2 commits January 3, 2023 23:30

fix(format)

73c3d06

fix(format): v2

e6bd3bc

westonpace approved these changes Jan 3, 2023

View reviewed changes

python/pyarrow/src/arrow/python/udf.cc Outdated Show resolved Hide resolved

vibhatha added 2 commits January 4, 2023 06:55

fix(test): error message updated

37df214

fix(format)

96842d8

asfimport mentioned this pull request Jan 5, 2023

[C++][Python] Register Multiple Kernels for a UDF #31612

Open

rtpsw mentioned this pull request Jan 24, 2023

GH-32916: [C++] [Python] User-defined tabular functions #14682

Merged

westonpace mentioned this pull request Jan 26, 2023

[Python] Adopt a more pythonic method for UDF registration #33883

Open

amol- closed this Mar 30, 2023

westonpace mentioned this pull request May 25, 2023

GH-35515: [C++][Python] Add non decomposable aggregation UDF #35514

Merged

5 tasks

ARROW-16212: [C++][Python] Register Multiple Kernels for a UDF #14320

ARROW-16212: [C++][Python] Register Multiple Kernels for a UDF #14320

Uh oh!

Conversation

vibhatha commented Oct 5, 2022

Uh oh!

vibhatha commented Oct 5, 2022

Uh oh!

github-actions bot commented Oct 5, 2022

Uh oh!

pitrou commented Oct 5, 2022

Uh oh!

vibhatha commented Oct 5, 2022

Uh oh!

vibhatha commented Dec 29, 2022

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vibhatha commented Jan 3, 2023

Uh oh!

westonpace commented Jan 3, 2023

Uh oh!

vibhatha commented Jan 3, 2023

Uh oh!

vibhatha commented Jan 3, 2023

Uh oh!

vibhatha Jan 3, 2023

Choose a reason for hiding this comment

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pitrou commented Jan 4, 2023

Uh oh!

westonpace commented Jan 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pitrou commented Jan 5, 2023

Uh oh!

vibhatha commented Jan 23, 2023

Uh oh!

amol- commented Mar 30, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

westonpace commented Jan 4, 2023 •

edited

Loading