-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-15639 [C++][Python] UDF Scalar Function Implementation #12590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
|
|
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great start. I'm excited for this feature. Thanks for taking this on.
|
@westonpace thank you for the detailed review. I will work on the suggested changes. It is exciting to write this new feature. Appreciate your support. |
eb3d20f to
d2e616a
Compare
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've started to look at this. Here are some comments but there'll probably be more when the biggest issues are tackled :-)
python/pyarrow/_compute.pyx
Outdated
| return wrap_input_type(c_input_type) | ||
|
|
||
|
|
||
| cdef class Arity(_Weakrefable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem resolved to me. I agree this custom class is completely overkill. Also, it doesn't match how this information is currently exposed:
>>> import pyarrow as pa, pyarrow.compute as pc
>>> pc.get_function("random")
arrow.compute.Function<name=random, kind=scalar, arity=0, num_kernels=1>
>>> pc.get_function("sort_indices")
arrow.compute.Function<name=sort_indices, kind=meta, arity=1, num_kernels=0>
>>> pc.get_function("add")
arrow.compute.Function<name=add, kind=scalar, arity=2, num_kernels=33>
>>> pc.get_function("if_else")
arrow.compute.Function<name=if_else, kind=scalar, arity=3, num_kernels=39>
>>> pc.get_function("choose")
arrow.compute.Function<name=choose, kind=scalar, arity=Ellipsis, num_kernels=32>(note the varargs convention of returning Ellipsis is perhaps not enough if we want to faithfully mirror the C++ function metadata, but that can be a followup PR)
|
@pitrou Thanks a lot for the very descriptive review. I will address these issues. |
|
@pitrou regarding this: #12590 (comment) Ah I see your point. But a few things about this. Aren't we are going to user the options feature to take in UDF specific options. The listed attributes in this suggestion are core to any UDF. I haven't deeply thought about how to expose dynamic options to Python yet. That's why it was empty in the first place. What do you think about this? cc @westonpace |
|
@vibhatha I'm not entirely clear on the concern. I don't mind if we have an inheritence tree of object structs. For example, this would be ok (this is totally made up and likely is not correct for aggregate UDFs at all): |
@westonpace This make sense to me. How about if we have to go for dynamic content, where we have to attach those values from Python to this struct. I haven't explored this deeply, that's why I mentioned that part is yet to be implemented. I am not 100% sure if we need such an option, but yet to verify it's necessity. |
|
This is for function registration so I don't think there will be any dynamic content here. For example, we currently have "scalar" and we know we will need "aggregate". There may be other classes of UDF that we add but each time we do so it will be intentional an accompanied by a code change. I don't think we're looking into dynamically adding new classes of UDFs. Function options & state are a slightly different story. We do want to support dynamic function option content. However, from the C++ perspective, both of these things will just be Function registration remains unchanged. Later, when they call their UDF we would do something like... Then the python layer could check to see if the options object extends ...then we'd probably need some logic when we are actually calling the PyFunc to grab the PyObject out of the options so that hopefully the UDF itself could look something like... In summary, I think a flat struct should be sufficient for any cases we need to tackle. |
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might have a bit of a problem with the function doc. Currently, Function does not take ownership of FunctionDoc in any way. That means that some external force has to make sure the lifetime of FunctionDoc outlives the lifetime of the registered Function. We have gotten away with this so far because all of our function docs are constants with static storage and thus live forever.
That will not work for python.
A) We could simply new up a heap variable with no intention of ever deleting it but that is going to make valgrind grumpy and it will also be an issue if we ever support unregistering UDFs (because the function doc wouldn't be deleted at that point) or if we support registering UDFs to custom registries (that may not live for the life of the program).
B) We could create a python "function registry" that proxies to the C++ function registry except it saves off function docs in a map. The "unregister function" operation (if we ever add one) could then remove the function doc from the map. Though we will need to repeat all of this logic when we get around to adding R UDFs and if any user wanted to add their own custom C++ UDFs they may end up repeating the logic too.
C) We could modify Function so that it actually takes ownership of the passed in doc (via value copy or unique pointer). We do not every copy function objects, we rarely create them, and the function doc is not accessed inside of any hot loop so I don't see this having any negative performance impact but I don't know why they were implemented in this way in the first place. At most I think we'd be looking at a few nanoseconds of startup cost to copy the function doc constants when we initialize the registry. We could also use the "optional onwership" pattern if we were concerned about this cost.
My preference would be C but I haven't worked with this code too much so CC @lidavidm / @pitrou for a second opinion.
|
I would probably prefer (C), I would guess that the reason why FunctionDoc is stored in Function as a pointer and not a smart pointer is just that it made it easier to declare the documentation as a static |
|
Ok, there's a memory leak and I will also help revamp the tests a bit. |
|
@westonpace Do you want to give this another review? |
|
@pitrou Thank you for the improvement 👍 |
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Thanks a lot for the support @westonpace @lidavidm @pitrou @jorisvandenbossche @amol- |
|
This patch broke several packaging builds: https://app.travis-ci.com/github/ursacomputing/crossbow/builds/250149018#L1829 |
|
I see the error. |
|
@kszucs That build compiles with Python 3.6, which we don't support anymore. |
|
|
it seems most of the builds are broken because of this, Should we fix the suggested for now? What’s the best? |
Perhaps, but they're not supposed to work, so... |
I see… |
We can just disable the Python module in Arrow C++ only for packages that still use Python 3.6. |
|
Curious about what is the long-term/stable fix for this issue? I think we re-labled this feature for 9.0.0. |
We probably also want to set the minimal Python version to 3.7 in the CMake configuration, instead of letting a compile error appear later. |
|
@vibhatha Could you open a Jira issue for this? I can work on this. |
|
@kou Thank you for the support. I created the JIRA: https://issues.apache.org/jira/browse/ARROW-16474 |
|
Thanks! |
|
Benchmark runs are scheduled for baseline = 7809c6d and contender = 7a0f00c. 7a0f00c is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
|
['Python', 'R'] benchmarks have high level of regressions. |
PR for Scalar UDF integration
This is the first phase of UDF integration to Arrow. This version only includes ScalarFunctions.
In future of PRs, Vector UDF (using Arrow VectorFunction), UDTF (user-defined table function)
and Aggregation UDFs will be integrated. This PR includes the following;