-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-16212: [C++][Python] Register Multiple Kernels for a UDF #14320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @pitrou @jorisvandenbossche @westonpace Could you please review this PR? |
|
@vibhatha Can you please focus on Arrow 10.0.0 release blockers for now? Thank you. |
Of course, thanks for the ping. This was picked up before that issue was assigned (yesterday). I just finalized this. And I am working on the one I picked yesterday. |
|
cc @westonpace |
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few nits.
My main concern is that, rather than have input_types be a dictionary, I think it would be best to have the argument names as a separate field. Otherwise it just forces the user to grapple with questions like "why is it a dictionary? What am I supposed to use for the key? Can different rows have different keys?" when that could be more intuitive.
def test_multi_kernel_registration():
def unary_function(ctx, x):
return pc.cast(pc.call_function("multiply", [x, 2],
memory_pool=ctx.memory_pool), x.type)
func_name = "y=x*2"
unary_doc = {"summary": "multiply by two function",
"description": "test multiply function"}
input_types = [
pa.int8(),
pa.int16(),
pa.int32(),
pa.int64(),
pa.float32(),
pa.float64()
]
input_names = ["array"]
output_types = [
pa.int8(),
pa.int16(),
pa.int32(),
pa.int64(),
pa.float32(),
pa.float64()
]
pc.register_scalar_function(unary_function,
func_name,
unary_doc,
input_types,
input_names,
output_types)
I see your point, but I have a counter argument for this (just my opinion). If we keep a separate list, it also makes it unclear which name goes for which type. Having them together makes it readable. We had a case for aggreagtes, where having them separate made it confusing. I am not saying it is the exact same case, but close enough. I would rather propose, documenting what we do clearly with a few examples in the docstring. |
|
We spoke on this briefly but I will add my comment here as well. I get your point about the aggregates. Having two vectors that have to match in length is not an ideal API. However, I still think it is better than having to consistently repeat ourselves. I'm not 100% against it so if we want to keep the status quo we can. |
Yes that's a fair argument. I will work on it. |
|
@westonpace I updated the PR and added a test case to validate the length of the |
| out_type : DataType | ||
| Output type of the function. | ||
| in_arg_types : List[List[DataType]] | ||
| A list of list of DataTypes which includes input types for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this okay? list of list of usage...
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems good to me now.
This is a backwards incompatible change to register_scalar_function. It's still experimental I'd say, so this is probably ok, but I know a few are using it.
@rtpsw do you want to take a look through this change? Also, this will eventually become a merge conflict for your table UDFs PR.
|
I have two issues here:
|
Assuming the actual UDF implementation is in C I think the "python UDF" is just mapping to the appropriate external library call and the external library will handle the type checking itself. For example, I know we had an example at one point where we registered numpy's If the actual UDF implementation is in python then I think performance doesn't matter in this context (e.g. prototyping) and they are probably mapping with something like
I'm not sure how best to address this. If I start from scratch I think I'd end up with something a bit more complex but hopefully use friendly (note, this proposal addresses your point 2 as well): The example would then become: @pitrou would that be a more favorable API or can you give more description of what you would like to change? |
|
One other possibility would be to first create the function and then register kernels one by one: udf = pc.register_scalar_function(
name="y=x*2",
arg_names=["x"],
doc = {"summary": "multiply by two function",
"description": "test multiply function"}
)
for in_type in [pa.int64(), pa.float64()]:
udf.add_kernel([in_type], in_type, impl)It could also open the way to a decorator-like syntax: udf = pc.register_scalar_function(
name="y=x*2",
arg_names=["x"],
doc = {"summary": "multiply by two function",
"description": "test multiply function"}
)
@udf.kernel([pa.float64()], pa.float64())
@udf.kernel([pa.int64()], pa.int64())
def impl(ctx, x):
# ...@jorisvandenbossche Opinions on which kind of API would be desirable here? |
|
@pitrou what approach should we take here? I completely understand your concern and will do the necessary changes to make this better. Also one proposal, this API is still experimental, do you think it is the best to improve it right now or do it in a follow up PR. |
|
Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍 |
This PR includes the support for multi-kernel registration with Scalar UDFs for Python.