Skip to content

Conversation

@vibhatha
Copy link
Contributor

This PR contains Python Scalar UDF documentation as an experimental version of docs.
At the moment we only support Scalar UDFs and the code snippets include how to use
UDFs with PyArrow.

@vibhatha
Copy link
Contributor Author

cc @westonpace @lidavidm @pitrou

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@vibhatha
Copy link
Contributor Author

cc @amol- @jorisvandenbossche

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good start, but see comments below.
Also, can you add API docs for register_scalar_function and ScalarUdfContext?

>>> func_args = [pc.scalar(5), ds.field("total_amount($)"), pc.scalar(2)]
>>> result_table = dataset.to_table(
... columns={
... 'total_amount_projected($)': ds.field('')._call(function_name, func_args),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is function_name here? Perhaps it would be better to spell it explicitly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also @jorisvandenbossche do you remember why Expression._call has a leading underscore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

function_name replaced with "affine"

@vibhatha
Copy link
Contributor Author

Also, can you add API docs for register_scalar_function and ScalarUdfContext?

Do you mean include it in docs for Python? I am not sure how to link it properly. But the text is there for both ScalarUdfContext and register_scalar_function where they are defined.

@vibhatha
Copy link
Contributor Author

@pitrou could you please help me with adding API docs? or explain how to do that?

@lidavidm
Copy link
Member

@vibhatha add docstrings to the Python/Cython code: https://numpydoc.readthedocs.io/en/latest/format.html (see the example on the bottom, or look through the Arrow source)

@pitrou
Copy link
Member

pitrou commented Jul 22, 2022

@vibhatha add docstrings to the Python/Cython code: https://numpydoc.readthedocs.io/en/latest/format.html (see the example on the bottom, or look through the Arrow source)

And also reference the given symbols in https://github.com/apache/arrow/blob/master/docs/source/python/api/compute.rst

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vibhatha Can I ask you to re-read your changes before pushing them? This would help minimize the review cycles. There are several redundancies and oddities here.

Comment on lines 457 to 461
>>> def affine_with_python(ctx, m, x, c):
... m = m[0].as_py()
... x = x[0].as_py()
... c = c[0].as_py()
... return pa.array([m * x + c])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this taking m[0]? I don't understand what this example is supposed to show...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed you asked me to show the all scalar scenario in an example rather than wording. That’s why I added one to showcase that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I referred to this: #13687 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DId I misinterpret your statement?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the all-scalar scenario works with the original UDF? no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou should we keep this example or just keep a note/warning? I am inclined towards the example, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to keep this example then we should:

  • Change the above ..note:: so that it isn't a note but a dedicated subsection.
  • Move the paragraph starting with "More generally," to come after the example.
  • Add a paragraph motivating the example.

However, I think we should expand this subsection so it isn't "Here is a wierd case that happens when all inputs are scalar" to "Your UDF should generally be capable of handling different combinations of scalar and array shaped inputs"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-organized and included content. Please check it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, but the example still makes no sense to me. Why not use the regular affine function here, instead of this scalar-specific definition?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is interesting, I think, that a person can define a UDF without using the Arrow compute functions at all, that is the most compelling point of the UDF feature in my mind since compositions of Arrow compute functions could already be done using expressions.

However, it is not clear from the description that this is the purpose of this example (is it?). It's also perhaps not the most motivating example since it can be expressed as an Arrow expression.

@vibhatha
Copy link
Contributor Author

@vibhatha add docstrings to the Python/Cython code: https://numpydoc.readthedocs.io/en/latest/format.html (see the example on the bottom, or look through the Arrow source)

@lidavidm I have already included the docs in the Cython in the original PR, but that function is in compute.pyx and only referred as an import in the compute.py. I was not sure whether to mark the method in Cython underscore function and call it in the compute.py and add the docs there. That was the confusing part.

@vibhatha
Copy link
Contributor Author

@vibhatha Can I ask you to re-read your changes before pushing them? This would help minimize the review cycles. There are several redundancies and oddities here.

Of course, sorry about the issues. Will make sure to avoid this.

@westonpace westonpace self-requested a review July 25, 2022 16:54
@vibhatha vibhatha requested a review from pitrou July 25, 2022 16:58
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding ScalarUdfContext and register_scalar_function: Both seem to have the correct docstrings in the python code (to the best of my knowledge). I think the problem what @pitrou alluded to. You need to include both in some kind of /api/*.rst file (and /pi/compute.rst does make the most sense).

For example: 757ec32

These /api/*.rst files are an index into the API documentation. If a type or method is not referenced by one of these files then API documentation will not be autogenerated for it.

Comment on lines 457 to 461
>>> def affine_with_python(ctx, m, x, c):
... m = m[0].as_py()
... x = x[0].as_py()
... c = c[0].as_py()
... return pa.array([m * x + c])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to keep this example then we should:

  • Change the above ..note:: so that it isn't a note but a dedicated subsection.
  • Move the paragraph starting with "More generally," to come after the example.
  • Add a paragraph motivating the example.

However, I think we should expand this subsection so it isn't "Here is a wierd case that happens when all inputs are scalar" to "Your UDF should generally be capable of handling different combinations of scalar and array shaped inputs"

@vibhatha
Copy link
Contributor Author

@westonpace thanks for the detailed review. I will address these.

------------------

PyArrow UDFs accept input types of both scalar and array. Also it can have
vivid combinations of these types. It is important that the UDF author must make sure,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "vivid" the right word here? @westonpace What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think any would be better.

Comment on lines 457 to 461
>>> def affine_with_python(ctx, m, x, c):
... m = m[0].as_py()
... x = x[0].as_py()
... c = c[0].as_py()
... return pa.array([m * x + c])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, but the example still makes no sense to me. Why not use the regular affine function here, instead of this scalar-specific definition?

@vibhatha
Copy link
Contributor Author

vibhatha commented Aug 1, 2022

Re: #13687 (comment)

@pitrou we can use the other affine function
But it beats the purpose right?
I thought we decided to elaborate the a single sentence from an example. May be I misunderstood your point.

@vibhatha vibhatha requested review from pitrou and westonpace August 1, 2022 14:29
@pitrou
Copy link
Member

pitrou commented Aug 1, 2022

The purpose is to show how a regular scalar function can get executed on all-scalar inputs with help from the compute function execution layer.

@pitrou
Copy link
Member

pitrou commented Aug 1, 2022

We can also remove this entire example (the one showing execution on all scalars), because I'm not sure how useful it actually is.

@vibhatha
Copy link
Contributor Author

vibhatha commented Aug 1, 2022

The purpose is to show how a regular scalar function can get executed on all-scalar inputs with help from the compute function execution layer.

Yes @pitrou that's the most important part. But shouldn't we point out how it could be used with a third-party library function. Let's say the user has already written a custom function which is an old python script, it doesn't use PyArrow compute API. I thought about that angel, that's why added the note. Is this something we don't need to discuss now and may be leave it for a different venue to discuss? Please correct me if I am wrong.

cc @westonpace Any thoughts?

------------------

PyArrow UDFs accept input types of both scalar and array. Also it can have
vivid combinations of these types. It is important that the UDF author must make sure,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think any would be better.

Comment on lines 457 to 461
>>> def affine_with_python(ctx, m, x, c):
... m = m[0].as_py()
... x = x[0].as_py()
... c = c[0].as_py()
... return pa.array([m * x + c])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is interesting, I think, that a person can define a UDF without using the Arrow compute functions at all, that is the most compelling point of the UDF feature in my mind since compositions of Arrow compute functions could already be done using expressions.

However, it is not clear from the description that this is the purpose of this example (is it?). It's also perhaps not the most motivating example since it can be expressed as an Arrow expression.

Comment on lines 474 to 479
>>> pc.call_function(function_name, [pa.scalar(10.1), pa.scalar(10.2), pa.scalar(20.2)])
<pyarrow.DoubleScalar: 123.22>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function correctly handles the all-scalar case but it does not handle other cases. Ideally, an example should demonstrate how to write a UDF that can handle all possible cases. For example:

print('10.0 * 5.0 + 1.0 should be 51.0')
print(f'Answer={pc.call_function(function_name, [pa.scalar(10.0), pa.scalar(5.0), pa.scalar(1.0)])}')

print('[10.0, 10.0] * [5.0, 6.0] + [1.0, 1.0] should be [51.0, 61.0]')
print(f'Answer={pc.call_function(function_name, [pa.array([10.0, 10.0]), pa.array([5.0, 6.0]), pa.array([1.0, 1.0])])}')

print('10.0 * [5.0, 6.0] + 1.0 should be [51.0, 61.0]')
print(f'Answer={pc.call_function(function_name, [pa.scalar(10.0), pa.array([5.0, 6.0]), pa.scalar(1.0)])}')

Right now, the function as it is designed, gives me this output:

10.0 * 5.0 + 1.0 should be 51.0
Answer=51.0
[10.0, 10.0] * [5.0, 6.0] + [1.0, 1.0] should be [51.0, 61.0]
Answer=[
  51
]
10.0 * [5.0, 6.0] + 1.0 should be [51.0, 61.0]
Traceback (most recent call last):
  File "/home/pace/experiments/arrow-17181/repr.py", line 39, in <module>
    print(f'Answer={pc.call_function(function_name, [pa.scalar(10.0), pa.array([5.0, 6.0]), pa.scalar(1.0)])}')
  File "pyarrow/_compute.pyx", line 560, in pyarrow._compute.call_function
  File "pyarrow/_compute.pyx", line 355, in pyarrow._compute.Function.call
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/_compute.pyx", line 2506, in pyarrow._compute._scalar_udf_callback
  File "/home/pace/experiments/arrow-17181/repr.py", line 21, in affine_with_python
    m = m[0].as_py()
TypeError: 'pyarrow.lib.DoubleScalar' object is not subscriptable

>>> pc.call_function(function_name, [pa.scalar(10.1), pa.scalar(10.2), pa.scalar(20.2)])
<pyarrow.DoubleScalar: 123.22>

Note that here the the final output is returned as an array. Depending the usage of vivid libraries
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it an array? I see:

<pyarrow.DoubleScalar: 123.22>

It probably should be an array.

>>> pc.call_function(function_name, [pa.scalar(10.1), pa.scalar(10.2), pa.scalar(20.2)])
<pyarrow.DoubleScalar: 123.22>

Note that here the the final output is returned as an array. Depending the usage of vivid libraries
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you are trying to say with the sentence that starts "Depending the..."

Comment on lines 516 to 519
Here note that the `ds.field('')_call()` returns an expression. The passed arguments
to this function call are expressions not scalar values
(i.e `pc.scalar(5.2), ds.field("value"), pc.scalar(2.1)`). This expression is evaluated
when the project operator uses this expression.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You say "The passed arguments to this function call are expressions not scalar values".

However, pc.scalar(5.2) and pc.scalar(2.1) look like scalar values. I'm not sure a user will recognize the subtle difference between pc.scalar(5.2) and pa.scalar(5.2) without further explanation.

@westonpace
Copy link
Member

I think part of the challenge with this documentation is that implementing affine in pure-python is not a very compelling use case. I think the more interesting case for UDFs is when we want to use some other library that does efficient compute and is capable of working with Arrow data. For example, numpy. Here is an example that exposes numpy's gcd function (greatest common divisor) as an Arrow function:

import numpy as np

import pyarrow as pa
import pyarrow.compute as pc

function_name = "numpy_gcd"
function_docs = {
       "summary": "Calculates the greatest common divisor",
       "description":
           "Given 'x' and 'y' find the greatest number that divides\n"
           "evenly into both x and y."
}

input_types = {
   "x" : pa.int64(),
   "y" : pa.int64()
}

output_type = pa.int64()

def to_np(val):
    if isinstance(val, pa.Scalar):
        return val.as_py()
    else:
        return np.array(val)

def gcd_numpy(ctx, x, y):
    np_x = to_np(x)
    np_y = to_np(y)
    return pa.array(np.gcd(np_x, np_y))

pc.register_scalar_function(gcd_numpy,
                            function_name,
                            function_docs,
                            input_types,
                            output_type)

print('gcd(27, 63) should be 9')
print(f'Answer={pc.call_function(function_name, [pa.scalar(27), pa.scalar(63)])}')
print()
print('gcd([27, 18], [54, 63]) should be [27, 9]')
print(f'Answer={pc.call_function(function_name, [pa.array([27, 18]), pa.array([54, 63])])}')
print()
print('gcd(27, [54, 18]) should be [27, 9]')
print(f'Answer={pc.call_function(function_name, [pa.scalar(27), pa.array([54, 18])])}')

Notice the use of the helper function to_np to convert from inputs of different shapes to ensure that we get something that numpy can work with.

@vibhatha
Copy link
Contributor Author

I think part of the challenge with this documentation is that implementing affine in pure-python is not a very compelling use case. I think the more interesting case for UDFs is when we want to use some other library that does efficient compute and is capable of working with Arrow data. For example, numpy. Here is an example that exposes numpy's gcd function (greatest common divisor) as an Arrow function:

import numpy as np

import pyarrow as pa
import pyarrow.compute as pc

function_name = "numpy_gcd"
function_docs = {
       "summary": "Calculates the greatest common divisor",
       "description":
           "Given 'x' and 'y' find the greatest number that divides\n"
           "evenly into both x and y."
}

input_types = {
   "x" : pa.int64(),
   "y" : pa.int64()
}

output_type = pa.int64()

def to_np(val):
    if isinstance(val, pa.Scalar):
        return val.as_py()
    else:
        return np.array(val)

def gcd_numpy(ctx, x, y):
    np_x = to_np(x)
    np_y = to_np(y)
    return pa.array(np.gcd(np_x, np_y))

pc.register_scalar_function(gcd_numpy,
                            function_name,
                            function_docs,
                            input_types,
                            output_type)

print('gcd(27, 63) should be 9')
print(f'Answer={pc.call_function(function_name, [pa.scalar(27), pa.scalar(63)])}')
print()
print('gcd([27, 18], [54, 63]) should be [27, 9]')
print(f'Answer={pc.call_function(function_name, [pa.array([27, 18]), pa.array([54, 63])])}')
print()
print('gcd(27, [54, 18]) should be [27, 9]')
print(f'Answer={pc.call_function(function_name, [pa.scalar(27), pa.array([54, 18])])}')

Notice the use of the helper function to_np to convert from inputs of different shapes to ensure that we get something that numpy can work with.

I see your point. I will update the example to use this.

@vibhatha
Copy link
Contributor Author

@jorisvandenbossche WDYT about this: #13687 (comment)

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think this is good enough now :-)
Thank you @vibhatha !

@pitrou pitrou merged commit 902781d into apache:master Sep 29, 2022
@vibhatha
Copy link
Contributor Author

Thanks everyone for the reviews 🙂

@ursabot
Copy link

ursabot commented Sep 29, 2022

Benchmark runs are scheduled for baseline = 60c9383 and contender = 902781d. 902781d is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.82% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.18% ⬆️0.07%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 902781d1 ec2-t3-xlarge-us-east-2
[Finished] 902781d1 test-mac-arm
[Failed] 902781d1 ursa-i9-9960x
[Finished] 902781d1 ursa-thinkcentre-m75q
[Finished] 60c93833 ec2-t3-xlarge-us-east-2
[Failed] 60c93833 test-mac-arm
[Failed] 60c93833 ursa-i9-9960x
[Finished] 60c93833 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants