ARROW-15639 [C++][Python] UDF Scalar Function Implementation #12590

vibhatha · 2022-03-09T13:25:00Z

PR for Scalar UDF integration

This is the first phase of UDF integration to Arrow. This version only includes ScalarFunctions.
In future of PRs, Vector UDF (using Arrow VectorFunction), UDTF (user-defined table function)
and Aggregation UDFs will be integrated. This PR includes the following;

UDF Python Scalar Function registration and usage
UDF Python Scalar Function Examples
UDF Python Scalar Function test cases
UDF C++ Example extended from Compute Function Example
Added aggregation example (optional to this PR: if required can remove and push in a different PR)

github-actions · 2022-03-09T13:30:17Z

https://issues.apache.org/jira/browse/ARROW-15639

github-actions · 2022-03-09T13:30:19Z

⚠️ Ticket has no components in JIRA, make sure you assign one.

github-actions · 2022-03-09T13:30:20Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

westonpace

This is a great start. I'm excited for this feature. Thanks for taking this on.

cpp/examples/arrow/aggregate_example.cc

python/pyarrow/_compute.pyx

python/pyarrow/public-api.pxi

python/pyarrow/tests/test_udf.py

vibhatha · 2022-03-22T03:44:03Z

@westonpace thank you for the detailed review. I will work on the suggested changes. It is exciting to write this new feature. Appreciate your support.

pitrou

Ok, I've started to look at this. Here are some comments but there'll probably be more when the biggest issues are tackled :-)

cpp/examples/arrow/CMakeLists.txt

cpp/examples/arrow/aggregate_example.cc

cpp/examples/arrow/udf_example.cc

pitrou · 2022-03-29T14:09:58Z

python/pyarrow/_compute.pyx

+        return wrap_input_type(c_input_type)
+
+
+cdef class Arity(_Weakrefable):


This doesn't seem resolved to me. I agree this custom class is completely overkill. Also, it doesn't match how this information is currently exposed:

>>> import pyarrow as pa, pyarrow.compute as pc >>> pc.get_function("random") arrow.compute.Function<name=random, kind=scalar, arity=0, num_kernels=1> >>> pc.get_function("sort_indices") arrow.compute.Function<name=sort_indices, kind=meta, arity=1, num_kernels=0> >>> pc.get_function("add") arrow.compute.Function<name=add, kind=scalar, arity=2, num_kernels=33> >>> pc.get_function("if_else") arrow.compute.Function<name=if_else, kind=scalar, arity=3, num_kernels=39> >>> pc.get_function("choose") arrow.compute.Function<name=choose, kind=scalar, arity=Ellipsis, num_kernels=32>

(note the varargs convention of returning Ellipsis is perhaps not enough if we want to faithfully mirror the C++ function metadata, but that can be a followup PR)

python/pyarrow/public-api.pxi

python/pyarrow/_compute.pxd

python/pyarrow/_compute.pyx

python/examples/udf/udf_example.py

vibhatha · 2022-03-29T15:50:00Z

@pitrou Thanks a lot for the very descriptive review. I will address these issues.

vibhatha · 2022-03-31T06:49:56Z

@pitrou regarding this: #12590 (comment)

Ah I see your point. But a few things about this.

Aren't we are going to user the options feature to take in UDF specific options. The listed attributes in this suggestion are core to any UDF. I haven't deeply thought about how to expose dynamic options to Python yet. That's why it was empty in the first place. What do you think about this?

cc @westonpace

westonpace · 2022-03-31T07:27:39Z

@vibhatha I'm not entirely clear on the concern. I don't mind if we have an inheritence tree of object structs. For example, this would be ok (this is totally made up and likely is not correct for aggregate UDFs at all):

struct UdfOptions {
  cp::Function::Kind kind;
  cp::Arity arity;
  const cp::FunctionDoc func_doc;
  std::vector<cp::InputType> in_types;
  cp::OutputType out_type;
  cp::MemAllocation::type mem_allocation;
  cp::NullHandling::type null_handling;
};

struct ScalarUdfOptions : public UdfOptions {
  std::string func_name;
};

struct AggregateUdfOptions : public UdfOptions {
  std::string init_func_name;
  std::string map_func_name;
  std::string reduce_func_name;
};

Status MakeScalarFunction(PyObject* function, const ScalarUdfOptions& options);
Status MakeAggregateFunction(PyObject* function, const AggregateUdfOptions& options);

vibhatha · 2022-03-31T12:10:44Z

@vibhatha I'm not entirely clear on the concern. I don't mind if we have an inheritence tree of object structs. For example, this would be ok (this is totally made up and likely is not correct for aggregate UDFs at all):

struct UdfOptions {
  cp::Function::Kind kind;
  cp::Arity arity;
  const cp::FunctionDoc func_doc;
  std::vector<cp::InputType> in_types;
  cp::OutputType out_type;
  cp::MemAllocation::type mem_allocation;
  cp::NullHandling::type null_handling;
};

struct ScalarUdfOptions : public UdfOptions {
  std::string func_name;
};

struct AggregateUdfOptions : public UdfOptions {
  std::string init_func_name;
  std::string map_func_name;
  std::string reduce_func_name;
};

Status MakeScalarFunction(PyObject* function, const ScalarUdfOptions& options);
Status MakeAggregateFunction(PyObject* function, const AggregateUdfOptions& options);

@westonpace This make sense to me. How about if we have to go for dynamic content, where we have to attach those values from Python to this struct. I haven't explored this deeply, that's why I mentioned that part is yet to be implemented. I am not 100% sure if we need such an option, but yet to verify it's necessity.

westonpace · 2022-03-31T18:47:41Z

This is for function registration so I don't think there will be any dynamic content here. For example, we currently have "scalar" and we know we will need "aggregate". There may be other classes of UDF that we add but each time we do so it will be intentional an accompanied by a code change. I don't think we're looking into dynamically adding new classes of UDFs.

Function options & state are a slightly different story. We do want to support dynamic function option content. However, from the C++ perspective, both of these things will just be PyObject*. For example, maybe a user defines a custom datetime formatting UDF and they want to take the format pattern and locale in as objects. I think we could do it this way...

class UdfOptions(object):
  def __init__(self):
    pass

class CustomDateFormatOptions(UdfOptions):
  def __init__(self, format, locale):
    self.format = format
    self.locale = locale

Function registration remains unchanged. Later, when they call their UDF we would do something like...

pc.call_function("custom_date_format", [timestamps_arr], CustomDateFormatOptions(my_format, my_locale))

Then the python layer could check to see if the options object extends UdfOptions and, if it does, pass it to some kind of CUdfOptions:

public struct UdfOptions {
  PyObject* options_obj;
};

...then we'd probably need some logic when we are actually calling the PyFunc to grab the PyObject out of the options so that hopefully the UDF itself could look something like...

def custom_date_format(arr, options):
  ...

In summary, I think a flat struct should be sufficient for any cases we need to tackle.

westonpace

I think we might have a bit of a problem with the function doc. Currently, Function does not take ownership of FunctionDoc in any way. That means that some external force has to make sure the lifetime of FunctionDoc outlives the lifetime of the registered Function. We have gotten away with this so far because all of our function docs are constants with static storage and thus live forever.

That will not work for python.

A) We could simply new up a heap variable with no intention of ever deleting it but that is going to make valgrind grumpy and it will also be an issue if we ever support unregistering UDFs (because the function doc wouldn't be deleted at that point) or if we support registering UDFs to custom registries (that may not live for the life of the program).

B) We could create a python "function registry" that proxies to the C++ function registry except it saves off function docs in a map. The "unregister function" operation (if we ever add one) could then remove the function doc from the map. Though we will need to repeat all of this logic when we get around to adding R UDFs and if any user wanted to add their own custom C++ UDFs they may end up repeating the logic too.

C) We could modify Function so that it actually takes ownership of the passed in doc (via value copy or unique pointer). We do not every copy function objects, we rarely create them, and the function doc is not accessed inside of any hot loop so I don't see this having any negative performance impact but I don't know why they were implemented in this way in the first place. At most I think we'd be looking at a few nanoseconds of startup cost to copy the function doc constants when we initialize the registry. We could also use the "optional onwership" pattern if we were concerned about this cost.

My preference would be C but I haven't worked with this code too much so CC @lidavidm / @pitrou for a second opinion.

python/pyarrow/tests/test_udf.py

cpp/src/arrow/python/udf.cc

cpp/src/arrow/python/udf.h

lidavidm · 2022-04-01T21:09:01Z

I would probably prefer (C), I would guess that the reason why FunctionDoc is stored in Function as a pointer and not a smart pointer is just that it made it easier to declare the documentation as a static

cpp/src/arrow/python/udf.cc

cpp/src/arrow/compute/function.h

cpp/examples/arrow/udf_example.cc

cpp/src/arrow/python/udf.h

python/pyarrow/tests/test_udf.py

pitrou · 2022-05-02T12:27:28Z

Ok, there's a memory leak and I will also help revamp the tests a bit.

cpp/src/arrow/compute/kernels/scalar_arithmetic.cc

pitrou · 2022-05-02T13:07:58Z

@westonpace Do you want to give this another review?

vibhatha · 2022-05-02T13:17:51Z

@pitrou Thank you for the improvement 👍

westonpace

Thanks for the cleanup @pitrou and @vibhatha . I think this is good now. Thanks for all the effort everyone!

vibhatha · 2022-05-03T08:39:54Z

Thanks a lot for the support @westonpace @lidavidm @pitrou @jorisvandenbossche @amol-

kszucs · 2022-05-03T14:08:35Z

This patch broke several packaging builds: https://app.travis-ci.com/github/ursacomputing/crossbow/builds/250149018#L1829

vibhatha · 2022-05-03T14:16:42Z

I see the error.
@kszucs could you please point me to all the other failed packaging builds?

pitrou · 2022-05-03T14:21:58Z

@kszucs That build compiles with Python 3.6, which we don't support anymore.

kszucs · 2022-05-03T14:24:01Z

@kszucs That build compiles with Python 3.6, which we don't support anymore.

@kou shall we remove CentOS 7, CentOS 8, AlmaLinux 8 and Ubuntu Bionic builds? Or the python libraries?

@pitrou it could be easier to make the builds compile for now.

kszucs · 2022-05-03T14:25:59Z

I see the error.
@kszucs could you please point me to all the other failed packaging builds?

@vibhatha sure, here is the relevant nightly builds report

vibhatha · 2022-05-03T14:47:14Z

@pitrou @kszucs

it seems most of the builds are broken because of this,

../src/arrow/python/udf.cc:43:9: error: '_Py_IsFinalizing' was not declared in this scope
1830     if (_Py_IsFinalizing()) {
1831         ^~~~~~~~~~~~~~~~
1832../src/arrow/python/udf.cc:43:9: note: suggested alternative: '_Py_Finalizing'
1833     if (_Py_IsFinalizing()) {
1834         ^~~~~~~~~~~~~~~~
1835         _Py_Finalizing

Should we fix the suggested for now?

What’s the best?

pitrou · 2022-05-03T14:48:52Z

@pitrou it could be easier to make the builds compile for now.

Perhaps, but they're not supposed to work, so...

vibhatha · 2022-05-03T14:50:28Z

@pitrou it could be easier to make the builds compile for now.

Perhaps, but they're not supposed to work, so...

I see…

kou · 2022-05-03T21:41:38Z

@kou shall we remove CentOS 7, CentOS 8, AlmaLinux 8 and Ubuntu Bionic builds? Or the python libraries?

We can just disable the Python module in Arrow C++ only for packages that still use Python 3.6.

vibhatha · 2022-05-04T04:09:39Z

Curious about what is the long-term/stable fix for this issue?
@kou @pitrou @westonpace @kszucs

I think we re-labled this feature for 9.0.0.

pitrou · 2022-05-04T07:29:58Z

We can just disable the Python module in Arrow C++ only for packages that still use Python 3.6.

We probably also want to set the minimal Python version to 3.7 in the CMake configuration, instead of letting a compile error appear later.

kou · 2022-05-04T21:02:02Z

@vibhatha Could you open a Jira issue for this? I can work on this.

vibhatha · 2022-05-05T00:39:13Z

@kou Thank you for the support. I created the JIRA: https://issues.apache.org/jira/browse/ARROW-16474

kou · 2022-05-05T20:09:55Z

Thanks!

ursabot · 2022-05-06T13:51:09Z

Benchmark runs are scheduled for baseline = 7809c6d and contender = 7a0f00c. 7a0f00c is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️1.52% ⬆️0.27%] test-mac-arm
[Finished ⬇️4.64% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️1.22% ⬆️0.28%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 7a0f00c1 ec2-t3-xlarge-us-east-2
[Failed] 7a0f00c1 test-mac-arm
[Finished] 7a0f00c1 ursa-i9-9960x
[Finished] 7a0f00c1 ursa-thinkcentre-m75q
[Finished] 7809c6d7 ec2-t3-xlarge-us-east-2
[Finished] 7809c6d7 test-mac-arm
[Finished] 7809c6d7 ursa-i9-9960x
[Finished] 7809c6d7 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot · 2022-05-06T13:51:18Z

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

vibhatha force-pushed the arrow-15639 branch from 7f3ef72 to 77d83df Compare March 9, 2022 13:25

github-actions bot added Component: C++ Component: Python labels Mar 9, 2022

vibhatha force-pushed the arrow-15639 branch from 3ca2970 to 8418e88 Compare March 14, 2022 05:44

vibhatha changed the title ~~ARROW-15639 [C++][Python] UDF Scalar Function Implementation [WIP]~~ ARROW-15639 [C++][Python] UDF Scalar Function Implementation Mar 14, 2022

vibhatha marked this pull request as ready for review March 14, 2022 07:12

westonpace requested review from amol-, jorisvandenbossche and westonpace March 14, 2022 20:08

vibhatha force-pushed the arrow-15639 branch from ee08c5c to e29841b Compare March 20, 2022 17:50

westonpace requested changes Mar 22, 2022

View reviewed changes

vibhatha force-pushed the arrow-15639 branch 2 times, most recently from eb3d20f to d2e616a Compare March 28, 2022 13:50

pitrou requested changes Mar 29, 2022

View reviewed changes

westonpace requested changes Apr 1, 2022

View reviewed changes

lidavidm reviewed Apr 1, 2022

View reviewed changes

cpp/src/arrow/python/udf.cc Outdated Show resolved Hide resolved

vibhatha requested review from lidavidm, pitrou and westonpace April 4, 2022 03:44

lidavidm reviewed Apr 4, 2022

View reviewed changes

vibhatha added 3 commits April 30, 2022 06:37

addressing reviews

8425e57

avoid a copying input types

eef3896

addressing move issue

a51a6a0

vibhatha requested a review from westonpace April 30, 2022 02:31

lidavidm approved these changes May 2, 2022

View reviewed changes

Fix Python reference leaks and improve tests

706087a

pitrou approved these changes May 2, 2022

View reviewed changes

cpp/src/arrow/compute/kernels/scalar_arithmetic.cc Outdated Show resolved Hide resolved

Fix subtract_checked doc

08880e7

pitrou closed this in 7a0f00c May 3, 2022

westonpace reviewed May 3, 2022

View reviewed changes

		return wrap_input_type(c_input_type)


		cdef class Arity(_Weakrefable):

ARROW-15639 [C++][Python] UDF Scalar Function Implementation #12590

ARROW-15639 [C++][Python] UDF Scalar Function Implementation #12590

Uh oh!

Conversation

vibhatha commented Mar 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 9, 2022

Uh oh!

github-actions bot commented Mar 9, 2022

Uh oh!

github-actions bot commented Mar 9, 2022

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vibhatha commented Mar 22, 2022

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pitrou Mar 29, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vibhatha commented Mar 29, 2022

Uh oh!

vibhatha commented Mar 31, 2022

Uh oh!

westonpace commented Mar 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vibhatha commented Mar 31, 2022

Uh oh!

westonpace commented Mar 31, 2022

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lidavidm commented Apr 1, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vibhatha commented Mar 9, 2022 •

edited

Loading

westonpace commented Mar 31, 2022 •

edited

Loading

kszucs commented May 3, 2022 •

edited

Loading