-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10317: [Python] Document compute function options #12076
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @amol- |
90e92cb to
b2e0374
Compare
|
Ping @jorisvandenbossche It would be nice for this PR to be reviewed soon because it needs rebasing and updating every time a new compute function is added. |
b2e0374 to
6647879
Compare
python/pyarrow/_compute.pyx
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure there is a strong guarantee that the options of skip_nulls and min_count will always forever match with those of ScalarAggregate? I mean, now they match, but I'm concerned that in 6 months we will have forgotten that the docstrings influence multiple classes and might accidentally add options that don't exist in the classes that reuse the existing docstrings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean? skip_nulls and min_count are individual options, not functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean whether they will stay the same for ScalarAggregateOptions vs ElementWiseAggregateOptions vs ... (the different places it is being used)?
It seems quite unlikely to me that the general description will not be correct anymore for one of those functions, but we can also always then update it if that would change (also if we all inline them duplicated, we need to think about updating the docstring if behaviour in C++ would change ..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I mean that there might be a point where the various places where they are used are not aligned anymore.
I'm not concerned about the current state of things, the functions report the docstrings of individual options etc, I'm mostly concerned that in 1 year from now we will forget how we designed that to be and someone might think that it's reasonable for example to add one more option in _skip_nulls_doc() causing a wrong option to be added to one of the Option classes using that function.
I guess one possible solution would be to have something inspecting the class and confirming the arguments in __init__ match with those documented. And throwing an error when they don't. At least developers would be informed when they break something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
someone might think that it's reasonable for example to add one more option in _skip_nulls_doc() causing a wrong option to be added to one of the Option classes using that function.
Hmm, I don't understand. Why would someone add an option there?
The entire purpose of _skip_nulls_doc() is to factor the documentation for a single option skip_nulls. It's not documenting a hypothetical function named "skip_nulls" to which people would add other options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess one possible solution would be to have something inspecting the class and confirming the arguments in
__init__match with those documented.
We have numpydoc validation check that already should do something like that? (I don't remember if it is ran by default, or is ran for the compute module)
But, that would also not really solve your concern that the "meaning" of the keyword would change for one of the functions? (checking that the signature keywords match with the documented keywords doesn't guarantee anything about whether the description content is correct or not)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I don't understand. Why would someone add an option there?
I suppose that Alessandro means something like the following: assume that at some point we add an additional accepted value to skip_nulls, and document it in the skip_nulls_doc() function. But that new value might only be accepted by a subset of the compute kernels that have a skip_nulls argument, and thus we would get an incorrect docstring for the others.
Now, since skip_nulls is a boolean keyword (True/False) currently, it seems not that likely there will be a new accepted value being added.
And again, if that happens, I think we can handle it at that time, and I think we will remember that this skip_nulls_doc() is used in several places, and thus have to verify if our changes are correct for all places where this is being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, the numpydoc check would catch a set of issues. I'm wondering btw if it's actually running given that it didn't catch https://github.com/apache/arrow/blob/master/python/pyarrow/array.pxi#L714
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering btw if it's actually running given that it didn't catch https://github.com/apache/arrow/blob/master/python/pyarrow/array.pxi#L714
I think the check for proper whitespace around the colon is not yet enabled. For now, #7732 only enabled check "PR01", which checks if all parameters are present in the docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On further inspection, it seems that archery numpydoc doesn't check class methods, opened https://issues.apache.org/jira/browse/ARROW-15321
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really nice!
python/pyarrow/tests/test_compute.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Generated values are uniformly-distributed, double-precision """ + | |
| """in range [0, 1). | |
| Generated values are uniformly-distributed, double-precision \ | |
| in range [0, 1). |
This is an alternative way to do this, doesn't need to end/start triple quotes, but because of the strange indentation not necessarily nicer ..
python/pyarrow/_compute.pyx
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean whether they will stay the same for ScalarAggregateOptions vs ElementWiseAggregateOptions vs ... (the different places it is being used)?
It seems quite unlikely to me that the general description will not be correct anymore for one of those functions, but we can also always then update it if that would change (also if we all inline them duplicated, we need to think about updating the docstring if behaviour in C++ would change ..)
Add docstrings for parameters of the various function options classes.
Automatically ingest those parameter docstrings into the generated compute function docstrings
(this uses vendored code from `numpydoc`).
Example with the `min_max` parameter docs:
* before:
```
array : Array-like
Argument to compute function
skip_nulls : optional
Parameter for ScalarAggregateOptions constructor. Either `options`
or `skip_nulls` can be passed, but not both at the same time.
min_count : optional
Parameter for ScalarAggregateOptions constructor. Either `options`
or `min_count` can be passed, but not both at the same time.
options : pyarrow.compute.ScalarAggregateOptions, optional
Parameters altering compute function semantics.
memory_pool : pyarrow.MemoryPool, optional
If not passed, will allocate memory from the default memory pool.
```
* after:
```
array : Array-like
Argument to compute function.
skip_nulls : bool, default True
Whether to skip (ignore) nulls in the input.
If False, any null in the input forces the output to null.
min_count : int, default 1
Minimum number of non-null values in the input. If the number
of non-null values is below `min_count`, the output is null.
options : pyarrow.compute.ScalarAggregateOptions, optional
Alternative way of passing options.
memory_pool : pyarrow.MemoryPool, optional
If not passed, will allocate memory from the default memory pool.
```
6647879 to
69b44e2
Compare
0be4a14 to
c6beb8c
Compare
|
Benchmark runs are scheduled for baseline = 111347d and contender = 0c5cd73. 0c5cd73 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Add docstrings for parameters of the various function options classes.
Automatically ingest those parameter docstrings into the generated compute function docstrings
(this uses vendored code from
numpydoc).Example with the
min_maxparameter docs: