-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation #13687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good start, but see comments below.
Also, can you add API docs for register_scalar_function and ScalarUdfContext?
docs/source/python/compute.rst
Outdated
| >>> func_args = [pc.scalar(5), ds.field("total_amount($)"), pc.scalar(2)] | ||
| >>> result_table = dataset.to_table( | ||
| ... columns={ | ||
| ... 'total_amount_projected($)': ds.field('')._call(function_name, func_args), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is function_name here? Perhaps it would be better to spell it explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also @jorisvandenbossche do you remember why Expression._call has a leading underscore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
function_name replaced with "affine"
Do you mean include it in docs for Python? I am not sure how to link it properly. But the text is there for both |
|
@pitrou could you please help me with adding API docs? or explain how to do that? |
|
@vibhatha add docstrings to the Python/Cython code: https://numpydoc.readthedocs.io/en/latest/format.html (see the example on the bottom, or look through the Arrow source) |
And also reference the given symbols in https://github.com/apache/arrow/blob/master/docs/source/python/api/compute.rst |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vibhatha Can I ask you to re-read your changes before pushing them? This would help minimize the review cycles. There are several redundancies and oddities here.
docs/source/python/compute.rst
Outdated
| >>> def affine_with_python(ctx, m, x, c): | ||
| ... m = m[0].as_py() | ||
| ... x = x[0].as_py() | ||
| ... c = c[0].as_py() | ||
| ... return pa.array([m * x + c]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this taking m[0]? I don't understand what this example is supposed to show...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed you asked me to show the all scalar scenario in an example rather than wording. That’s why I added one to showcase that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I referred to this: #13687 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DId I misinterpret your statement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the all-scalar scenario works with the original UDF? no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou should we keep this example or just keep a note/warning? I am inclined towards the example, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to keep this example then we should:
- Change the above
..note::so that it isn't a note but a dedicated subsection. - Move the paragraph starting with "More generally," to come after the example.
- Add a paragraph motivating the example.
However, I think we should expand this subsection so it isn't "Here is a wierd case that happens when all inputs are scalar" to "Your UDF should generally be capable of handling different combinations of scalar and array shaped inputs"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I re-organized and included content. Please check it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, but the example still makes no sense to me. Why not use the regular affine function here, instead of this scalar-specific definition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is interesting, I think, that a person can define a UDF without using the Arrow compute functions at all, that is the most compelling point of the UDF feature in my mind since compositions of Arrow compute functions could already be done using expressions.
However, it is not clear from the description that this is the purpose of this example (is it?). It's also perhaps not the most motivating example since it can be expressed as an Arrow expression.
@lidavidm I have already included the docs in the Cython in the original PR, but that function is in compute.pyx and only referred as an import in the compute.py. I was not sure whether to mark the method in Cython underscore function and call it in the compute.py and add the docs there. That was the confusing part. |
Of course, sorry about the issues. Will make sure to avoid this. |
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding ScalarUdfContext and register_scalar_function: Both seem to have the correct docstrings in the python code (to the best of my knowledge). I think the problem what @pitrou alluded to. You need to include both in some kind of /api/*.rst file (and /pi/compute.rst does make the most sense).
For example: 757ec32
These /api/*.rst files are an index into the API documentation. If a type or method is not referenced by one of these files then API documentation will not be autogenerated for it.
docs/source/python/compute.rst
Outdated
| >>> def affine_with_python(ctx, m, x, c): | ||
| ... m = m[0].as_py() | ||
| ... x = x[0].as_py() | ||
| ... c = c[0].as_py() | ||
| ... return pa.array([m * x + c]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to keep this example then we should:
- Change the above
..note::so that it isn't a note but a dedicated subsection. - Move the paragraph starting with "More generally," to come after the example.
- Add a paragraph motivating the example.
However, I think we should expand this subsection so it isn't "Here is a wierd case that happens when all inputs are scalar" to "Your UDF should generally be capable of handling different combinations of scalar and array shaped inputs"
|
@westonpace thanks for the detailed review. I will address these. |
docs/source/python/compute.rst
Outdated
| ------------------ | ||
|
|
||
| PyArrow UDFs accept input types of both scalar and array. Also it can have | ||
| vivid combinations of these types. It is important that the UDF author must make sure, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "vivid" the right word here? @westonpace What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think any would be better.
docs/source/python/compute.rst
Outdated
| >>> def affine_with_python(ctx, m, x, c): | ||
| ... m = m[0].as_py() | ||
| ... x = x[0].as_py() | ||
| ... c = c[0].as_py() | ||
| ... return pa.array([m * x + c]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, but the example still makes no sense to me. Why not use the regular affine function here, instead of this scalar-specific definition?
|
Re: #13687 (comment) @pitrou we can use the other |
|
The purpose is to show how a regular scalar function can get executed on all-scalar inputs with help from the compute function execution layer. |
|
We can also remove this entire example (the one showing execution on all scalars), because I'm not sure how useful it actually is. |
Yes @pitrou that's the most important part. But shouldn't we point out how it could be used with a third-party library function. Let's say the user has already written a custom function which is an old python script, it doesn't use PyArrow compute API. I thought about that angel, that's why added the note. Is this something we don't need to discuss now and may be leave it for a different venue to discuss? Please correct me if I am wrong. cc @westonpace Any thoughts? |
docs/source/python/compute.rst
Outdated
| ------------------ | ||
|
|
||
| PyArrow UDFs accept input types of both scalar and array. Also it can have | ||
| vivid combinations of these types. It is important that the UDF author must make sure, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think any would be better.
docs/source/python/compute.rst
Outdated
| >>> def affine_with_python(ctx, m, x, c): | ||
| ... m = m[0].as_py() | ||
| ... x = x[0].as_py() | ||
| ... c = c[0].as_py() | ||
| ... return pa.array([m * x + c]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is interesting, I think, that a person can define a UDF without using the Arrow compute functions at all, that is the most compelling point of the UDF feature in my mind since compositions of Arrow compute functions could already be done using expressions.
However, it is not clear from the description that this is the purpose of this example (is it?). It's also perhaps not the most motivating example since it can be expressed as an Arrow expression.
docs/source/python/compute.rst
Outdated
| >>> pc.call_function(function_name, [pa.scalar(10.1), pa.scalar(10.2), pa.scalar(20.2)]) | ||
| <pyarrow.DoubleScalar: 123.22> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function correctly handles the all-scalar case but it does not handle other cases. Ideally, an example should demonstrate how to write a UDF that can handle all possible cases. For example:
print('10.0 * 5.0 + 1.0 should be 51.0')
print(f'Answer={pc.call_function(function_name, [pa.scalar(10.0), pa.scalar(5.0), pa.scalar(1.0)])}')
print('[10.0, 10.0] * [5.0, 6.0] + [1.0, 1.0] should be [51.0, 61.0]')
print(f'Answer={pc.call_function(function_name, [pa.array([10.0, 10.0]), pa.array([5.0, 6.0]), pa.array([1.0, 1.0])])}')
print('10.0 * [5.0, 6.0] + 1.0 should be [51.0, 61.0]')
print(f'Answer={pc.call_function(function_name, [pa.scalar(10.0), pa.array([5.0, 6.0]), pa.scalar(1.0)])}')
Right now, the function as it is designed, gives me this output:
10.0 * 5.0 + 1.0 should be 51.0
Answer=51.0
[10.0, 10.0] * [5.0, 6.0] + [1.0, 1.0] should be [51.0, 61.0]
Answer=[
51
]
10.0 * [5.0, 6.0] + 1.0 should be [51.0, 61.0]
Traceback (most recent call last):
File "/home/pace/experiments/arrow-17181/repr.py", line 39, in <module>
print(f'Answer={pc.call_function(function_name, [pa.scalar(10.0), pa.array([5.0, 6.0]), pa.scalar(1.0)])}')
File "pyarrow/_compute.pyx", line 560, in pyarrow._compute.call_function
File "pyarrow/_compute.pyx", line 355, in pyarrow._compute.Function.call
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/_compute.pyx", line 2506, in pyarrow._compute._scalar_udf_callback
File "/home/pace/experiments/arrow-17181/repr.py", line 21, in affine_with_python
m = m[0].as_py()
TypeError: 'pyarrow.lib.DoubleScalar' object is not subscriptable
docs/source/python/compute.rst
Outdated
| >>> pc.call_function(function_name, [pa.scalar(10.1), pa.scalar(10.2), pa.scalar(20.2)]) | ||
| <pyarrow.DoubleScalar: 123.22> | ||
|
|
||
| Note that here the the final output is returned as an array. Depending the usage of vivid libraries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it an array? I see:
<pyarrow.DoubleScalar: 123.22>
It probably should be an array.
docs/source/python/compute.rst
Outdated
| >>> pc.call_function(function_name, [pa.scalar(10.1), pa.scalar(10.2), pa.scalar(20.2)]) | ||
| <pyarrow.DoubleScalar: 123.22> | ||
|
|
||
| Note that here the the final output is returned as an array. Depending the usage of vivid libraries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you are trying to say with the sentence that starts "Depending the..."
docs/source/python/compute.rst
Outdated
| Here note that the `ds.field('')_call()` returns an expression. The passed arguments | ||
| to this function call are expressions not scalar values | ||
| (i.e `pc.scalar(5.2), ds.field("value"), pc.scalar(2.1)`). This expression is evaluated | ||
| when the project operator uses this expression. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You say "The passed arguments to this function call are expressions not scalar values".
However, pc.scalar(5.2) and pc.scalar(2.1) look like scalar values. I'm not sure a user will recognize the subtle difference between pc.scalar(5.2) and pa.scalar(5.2) without further explanation.
|
I think part of the challenge with this documentation is that implementing Notice the use of the helper function |
I see your point. I will update the example to use this. |
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
7f47cdf to
227db70
Compare
|
@jorisvandenbossche WDYT about this: #13687 (comment) |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think this is good enough now :-)
Thank you @vibhatha !
|
Thanks everyone for the reviews 🙂 |
|
Benchmark runs are scheduled for baseline = 60c9383 and contender = 902781d. 902781d is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
This PR contains Python Scalar UDF documentation as an experimental version of docs.
At the moment we only support Scalar UDFs and the code snippets include how to use
UDFs with PyArrow.