Skip to content

Conversation

@vibhatha
Copy link
Contributor

@vibhatha vibhatha commented Mar 9, 2022

PR for Scalar UDF integration

This is the first phase of UDF integration to Arrow. This version only includes ScalarFunctions.
In future of PRs, Vector UDF (using Arrow VectorFunction), UDTF (user-defined table function)
and Aggregation UDFs will be integrated. This PR includes the following;

  • UDF Python Scalar Function registration and usage
  • UDF Python Scalar Function Examples
  • UDF Python Scalar Function test cases
  • UDF C++ Example extended from Compute Function Example
  • Added aggregation example (optional to this PR: if required can remove and push in a different PR)

@github-actions
Copy link

github-actions bot commented Mar 9, 2022

@github-actions
Copy link

github-actions bot commented Mar 9, 2022

⚠️ Ticket has no components in JIRA, make sure you assign one.

@github-actions
Copy link

github-actions bot commented Mar 9, 2022

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@vibhatha vibhatha changed the title ARROW-15639 [C++][Python] UDF Scalar Function Implementation [WIP] ARROW-15639 [C++][Python] UDF Scalar Function Implementation Mar 14, 2022
@vibhatha vibhatha marked this pull request as ready for review March 14, 2022 07:12
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start. I'm excited for this feature. Thanks for taking this on.

@vibhatha
Copy link
Contributor Author

@westonpace thank you for the detailed review. I will work on the suggested changes. It is exciting to write this new feature. Appreciate your support.

@vibhatha vibhatha force-pushed the arrow-15639 branch 2 times, most recently from eb3d20f to d2e616a Compare March 28, 2022 13:50
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've started to look at this. Here are some comments but there'll probably be more when the biggest issues are tackled :-)

return wrap_input_type(c_input_type)


cdef class Arity(_Weakrefable):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem resolved to me. I agree this custom class is completely overkill. Also, it doesn't match how this information is currently exposed:

>>> import pyarrow as pa, pyarrow.compute as pc
>>> pc.get_function("random")
arrow.compute.Function<name=random, kind=scalar, arity=0, num_kernels=1>
>>> pc.get_function("sort_indices")
arrow.compute.Function<name=sort_indices, kind=meta, arity=1, num_kernels=0>
>>> pc.get_function("add")
arrow.compute.Function<name=add, kind=scalar, arity=2, num_kernels=33>
>>> pc.get_function("if_else")
arrow.compute.Function<name=if_else, kind=scalar, arity=3, num_kernels=39>
>>> pc.get_function("choose")
arrow.compute.Function<name=choose, kind=scalar, arity=Ellipsis, num_kernels=32>

(note the varargs convention of returning Ellipsis is perhaps not enough if we want to faithfully mirror the C++ function metadata, but that can be a followup PR)

@vibhatha
Copy link
Contributor Author

@pitrou Thanks a lot for the very descriptive review. I will address these issues.

@vibhatha
Copy link
Contributor Author

@pitrou regarding this: #12590 (comment)

Ah I see your point. But a few things about this.

Aren't we are going to user the options feature to take in UDF specific options. The listed attributes in this suggestion are core to any UDF. I haven't deeply thought about how to expose dynamic options to Python yet. That's why it was empty in the first place. What do you think about this?

cc @westonpace

@westonpace
Copy link
Member

westonpace commented Mar 31, 2022

@vibhatha I'm not entirely clear on the concern. I don't mind if we have an inheritence tree of object structs. For example, this would be ok (this is totally made up and likely is not correct for aggregate UDFs at all):

struct UdfOptions {
  cp::Function::Kind kind;
  cp::Arity arity;
  const cp::FunctionDoc func_doc;
  std::vector<cp::InputType> in_types;
  cp::OutputType out_type;
  cp::MemAllocation::type mem_allocation;
  cp::NullHandling::type null_handling;
};

struct ScalarUdfOptions : public UdfOptions {
  std::string func_name;
};

struct AggregateUdfOptions : public UdfOptions {
  std::string init_func_name;
  std::string map_func_name;
  std::string reduce_func_name;
};

Status MakeScalarFunction(PyObject* function, const ScalarUdfOptions& options);
Status MakeAggregateFunction(PyObject* function, const AggregateUdfOptions& options);

@vibhatha
Copy link
Contributor Author

@vibhatha I'm not entirely clear on the concern. I don't mind if we have an inheritence tree of object structs. For example, this would be ok (this is totally made up and likely is not correct for aggregate UDFs at all):

struct UdfOptions {
  cp::Function::Kind kind;
  cp::Arity arity;
  const cp::FunctionDoc func_doc;
  std::vector<cp::InputType> in_types;
  cp::OutputType out_type;
  cp::MemAllocation::type mem_allocation;
  cp::NullHandling::type null_handling;
};

struct ScalarUdfOptions : public UdfOptions {
  std::string func_name;
};

struct AggregateUdfOptions : public UdfOptions {
  std::string init_func_name;
  std::string map_func_name;
  std::string reduce_func_name;
};

Status MakeScalarFunction(PyObject* function, const ScalarUdfOptions& options);
Status MakeAggregateFunction(PyObject* function, const AggregateUdfOptions& options);

@westonpace This make sense to me. How about if we have to go for dynamic content, where we have to attach those values from Python to this struct. I haven't explored this deeply, that's why I mentioned that part is yet to be implemented. I am not 100% sure if we need such an option, but yet to verify it's necessity.

@westonpace
Copy link
Member

This is for function registration so I don't think there will be any dynamic content here. For example, we currently have "scalar" and we know we will need "aggregate". There may be other classes of UDF that we add but each time we do so it will be intentional an accompanied by a code change. I don't think we're looking into dynamically adding new classes of UDFs.

Function options & state are a slightly different story. We do want to support dynamic function option content. However, from the C++ perspective, both of these things will just be PyObject*. For example, maybe a user defines a custom datetime formatting UDF and they want to take the format pattern and locale in as objects. I think we could do it this way...

class UdfOptions(object):
  def __init__(self):
    pass

class CustomDateFormatOptions(UdfOptions):
  def __init__(self, format, locale):
    self.format = format
    self.locale = locale

Function registration remains unchanged. Later, when they call their UDF we would do something like...

pc.call_function("custom_date_format", [timestamps_arr], CustomDateFormatOptions(my_format, my_locale))

Then the python layer could check to see if the options object extends UdfOptions and, if it does, pass it to some kind of CUdfOptions:

public struct UdfOptions {
  PyObject* options_obj;
};

...then we'd probably need some logic when we are actually calling the PyFunc to grab the PyObject out of the options so that hopefully the UDF itself could look something like...

def custom_date_format(arr, options):
  ...

In summary, I think a flat struct should be sufficient for any cases we need to tackle.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might have a bit of a problem with the function doc. Currently, Function does not take ownership of FunctionDoc in any way. That means that some external force has to make sure the lifetime of FunctionDoc outlives the lifetime of the registered Function. We have gotten away with this so far because all of our function docs are constants with static storage and thus live forever.

That will not work for python.

A) We could simply new up a heap variable with no intention of ever deleting it but that is going to make valgrind grumpy and it will also be an issue if we ever support unregistering UDFs (because the function doc wouldn't be deleted at that point) or if we support registering UDFs to custom registries (that may not live for the life of the program).

B) We could create a python "function registry" that proxies to the C++ function registry except it saves off function docs in a map. The "unregister function" operation (if we ever add one) could then remove the function doc from the map. Though we will need to repeat all of this logic when we get around to adding R UDFs and if any user wanted to add their own custom C++ UDFs they may end up repeating the logic too.

C) We could modify Function so that it actually takes ownership of the passed in doc (via value copy or unique pointer). We do not every copy function objects, we rarely create them, and the function doc is not accessed inside of any hot loop so I don't see this having any negative performance impact but I don't know why they were implemented in this way in the first place. At most I think we'd be looking at a few nanoseconds of startup cost to copy the function doc constants when we initialize the registry. We could also use the "optional onwership" pattern if we were concerned about this cost.

My preference would be C but I haven't worked with this code too much so CC @lidavidm / @pitrou for a second opinion.

@lidavidm
Copy link
Member

lidavidm commented Apr 1, 2022

I would probably prefer (C), I would guess that the reason why FunctionDoc is stored in Function as a pointer and not a smart pointer is just that it made it easier to declare the documentation as a static

@vibhatha vibhatha requested a review from westonpace April 30, 2022 02:31
@pitrou
Copy link
Member

pitrou commented May 2, 2022

Ok, there's a memory leak and I will also help revamp the tests a bit.

@pitrou
Copy link
Member

pitrou commented May 2, 2022

@westonpace Do you want to give this another review?

@vibhatha
Copy link
Contributor Author

vibhatha commented May 2, 2022

@pitrou Thank you for the improvement 👍

@pitrou pitrou closed this in 7a0f00c May 3, 2022
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the cleanup @pitrou and @vibhatha . I think this is good now. Thanks for all the effort everyone!

@vibhatha
Copy link
Contributor Author

vibhatha commented May 3, 2022

Thanks a lot for the support @westonpace @lidavidm @pitrou @jorisvandenbossche @amol-

@kszucs
Copy link
Member

kszucs commented May 3, 2022

This patch broke several packaging builds: https://app.travis-ci.com/github/ursacomputing/crossbow/builds/250149018#L1829

@vibhatha
Copy link
Contributor Author

vibhatha commented May 3, 2022

I see the error.
@kszucs could you please point me to all the other failed packaging builds?

@pitrou
Copy link
Member

pitrou commented May 3, 2022

@kszucs That build compiles with Python 3.6, which we don't support anymore.

@kszucs
Copy link
Member

kszucs commented May 3, 2022

@kszucs That build compiles with Python 3.6, which we don't support anymore.

@kou shall we remove CentOS 7, CentOS 8, AlmaLinux 8 and Ubuntu Bionic builds? Or the python libraries?

@pitrou it could be easier to make the builds compile for now.

@kszucs
Copy link
Member

kszucs commented May 3, 2022

I see the error.
@kszucs could you please point me to all the other failed packaging builds?

@vibhatha sure, here is the relevant nightly builds report

@vibhatha
Copy link
Contributor Author

vibhatha commented May 3, 2022

@pitrou @kszucs

it seems most of the builds are broken because of this,

../src/arrow/python/udf.cc:43:9: error: '_Py_IsFinalizing' was not declared in this scope
1830     if (_Py_IsFinalizing()) {
1831         ^~~~~~~~~~~~~~~~
1832../src/arrow/python/udf.cc:43:9: note: suggested alternative: '_Py_Finalizing'
1833     if (_Py_IsFinalizing()) {
1834         ^~~~~~~~~~~~~~~~
1835         _Py_Finalizing

Should we fix the suggested for now?

What’s the best?

@pitrou
Copy link
Member

pitrou commented May 3, 2022

@pitrou it could be easier to make the builds compile for now.

Perhaps, but they're not supposed to work, so...

@vibhatha
Copy link
Contributor Author

vibhatha commented May 3, 2022

@pitrou it could be easier to make the builds compile for now.

Perhaps, but they're not supposed to work, so...

I see…

@kou
Copy link
Member

kou commented May 3, 2022

@kou shall we remove CentOS 7, CentOS 8, AlmaLinux 8 and Ubuntu Bionic builds? Or the python libraries?

We can just disable the Python module in Arrow C++ only for packages that still use Python 3.6.

@vibhatha
Copy link
Contributor Author

vibhatha commented May 4, 2022

Curious about what is the long-term/stable fix for this issue?
@kou @pitrou @westonpace @kszucs

I think we re-labled this feature for 9.0.0.

@pitrou
Copy link
Member

pitrou commented May 4, 2022

We can just disable the Python module in Arrow C++ only for packages that still use Python 3.6.

We probably also want to set the minimal Python version to 3.7 in the CMake configuration, instead of letting a compile error appear later.

@kou
Copy link
Member

kou commented May 4, 2022

@vibhatha Could you open a Jira issue for this? I can work on this.

@vibhatha
Copy link
Contributor Author

vibhatha commented May 5, 2022

@kou Thank you for the support. I created the JIRA: https://issues.apache.org/jira/browse/ARROW-16474

@kou
Copy link
Member

kou commented May 5, 2022

Thanks!

@ursabot
Copy link

ursabot commented May 6, 2022

Benchmark runs are scheduled for baseline = 7809c6d and contender = 7a0f00c. 7a0f00c is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️1.52% ⬆️0.27%] test-mac-arm
[Finished ⬇️4.64% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️1.22% ⬆️0.28%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 7a0f00c1 ec2-t3-xlarge-us-east-2
[Failed] 7a0f00c1 test-mac-arm
[Finished] 7a0f00c1 ursa-i9-9960x
[Finished] 7a0f00c1 ursa-thinkcentre-m75q
[Finished] 7809c6d7 ec2-t3-xlarge-us-east-2
[Finished] 7809c6d7 test-mac-arm
[Finished] 7809c6d7 ursa-i9-9960x
[Finished] 7809c6d7 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented May 6, 2022

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants