ARROW-10317: [Python] Document compute function options #12076

pitrou · 2022-01-04T19:39:20Z

Add docstrings for parameters of the various function options classes.
Automatically ingest those parameter docstrings into the generated compute function docstrings
(this uses vendored code from numpydoc).

Example with the min_max parameter docs:

before:

array : Array-like
    Argument to compute function
skip_nulls : optional
    Parameter for ScalarAggregateOptions constructor. Either `options`
    or `skip_nulls` can be passed, but not both at the same time.
min_count : optional
    Parameter for ScalarAggregateOptions constructor. Either `options`
    or `min_count` can be passed, but not both at the same time.
options : pyarrow.compute.ScalarAggregateOptions, optional
    Parameters altering compute function semantics.
memory_pool : pyarrow.MemoryPool, optional
    If not passed, will allocate memory from the default memory pool.

after:

array : Array-like
    Argument to compute function.
skip_nulls : bool, default True
    Whether to skip (ignore) nulls in the input.
    If False, any null in the input forces the output to null.
min_count : int, default 1
    Minimum number of non-null values in the input.  If the number
    of non-null values is below `min_count`, the output is null.
options : pyarrow.compute.ScalarAggregateOptions, optional
    Alternative way of passing options.
memory_pool : pyarrow.MemoryPool, optional
    If not passed, will allocate memory from the default memory pool.

pitrou · 2022-01-04T19:39:36Z

cc @amol-

github-actions · 2022-01-04T19:39:40Z

https://issues.apache.org/jira/browse/ARROW-10317

pitrou · 2022-01-11T13:30:27Z

Ping @jorisvandenbossche

It would be nice for this PR to be reviewed soon because it needs rebasing and updating every time a new compute function is added.

amol- · 2022-01-13T09:03:36Z

python/pyarrow/_compute.pyx

Are we sure there is a strong guarantee that the options of skip_nulls and min_count will always forever match with those of ScalarAggregate? I mean, now they match, but I'm concerned that in 6 months we will have forgotten that the docstrings influence multiple classes and might accidentally add options that don't exist in the classes that reuse the existing docstrings.

I'm not sure what you mean? skip_nulls and min_count are individual options, not functions.

You mean whether they will stay the same for ScalarAggregateOptions vs ElementWiseAggregateOptions vs ... (the different places it is being used)?

It seems quite unlikely to me that the general description will not be correct anymore for one of those functions, but we can also always then update it if that would change (also if we all inline them duplicated, we need to think about updating the docstring if behaviour in C++ would change ..)

Yes, I mean that there might be a point where the various places where they are used are not aligned anymore.
I'm not concerned about the current state of things, the functions report the docstrings of individual options etc, I'm mostly concerned that in 1 year from now we will forget how we designed that to be and someone might think that it's reasonable for example to add one more option in _skip_nulls_doc() causing a wrong option to be added to one of the Option classes using that function.

I guess one possible solution would be to have something inspecting the class and confirming the arguments in __init__ match with those documented. And throwing an error when they don't. At least developers would be informed when they break something.

someone might think that it's reasonable for example to add one more option in _skip_nulls_doc() causing a wrong option to be added to one of the Option classes using that function.

Hmm, I don't understand. Why would someone add an option there?
The entire purpose of _skip_nulls_doc() is to factor the documentation for a single option skip_nulls. It's not documenting a hypothetical function named "skip_nulls" to which people would add other options.

I guess one possible solution would be to have something inspecting the class and confirming the arguments in __init__ match with those documented.

We have numpydoc validation check that already should do something like that? (I don't remember if it is ran by default, or is ran for the compute module)

But, that would also not really solve your concern that the "meaning" of the keyword would change for one of the functions? (checking that the signature keywords match with the documented keywords doesn't guarantee anything about whether the description content is correct or not)

Hmm, I don't understand. Why would someone add an option there?

I suppose that Alessandro means something like the following: assume that at some point we add an additional accepted value to skip_nulls, and document it in the skip_nulls_doc() function. But that new value might only be accepted by a subset of the compute kernels that have a skip_nulls argument, and thus we would get an incorrect docstring for the others.

Now, since skip_nulls is a boolean keyword (True/False) currently, it seems not that likely there will be a new accepted value being added.
And again, if that happens, I think we can handle it at that time, and I think we will remember that this skip_nulls_doc() is used in several places, and thus have to verify if our changes are correct for all places where this is being used.

You are right, the numpydoc check would catch a set of issues. I'm wondering btw if it's actually running given that it didn't catch https://github.com/apache/arrow/blob/master/python/pyarrow/array.pxi#L714

I'm wondering btw if it's actually running given that it didn't catch https://github.com/apache/arrow/blob/master/python/pyarrow/array.pxi#L714

I think the check for proper whitespace around the colon is not yet enabled. For now, #7732 only enabled check "PR01", which checks if all parameters are present in the docstring

On further inspection, it seems that archery numpydoc doesn't check class methods, opened https://issues.apache.org/jira/browse/ARROW-15321

jorisvandenbossche

This is really nice!

python/pyarrow/compute.py

jorisvandenbossche · 2022-01-13T09:21:14Z

python/pyarrow/tests/test_compute.py

Suggested change

Generated values are uniformly-distributed, double-precision """ +

"""in range [0, 1).

Generated values are uniformly-distributed, double-precision \

in range [0, 1).

This is an alternative way to do this, doesn't need to end/start triple quotes, but because of the strange indentation not necessarily nicer ..

python/pyarrow/_compute.pyx

jorisvandenbossche · 2022-01-13T09:37:46Z

python/pyarrow/_compute.pyx

You mean whether they will stay the same for ScalarAggregateOptions vs ElementWiseAggregateOptions vs ... (the different places it is being used)?

It seems quite unlikely to me that the general description will not be correct anymore for one of those functions, but we can also always then update it if that would change (also if we all inline them duplicated, we need to think about updating the docstring if behaviour in C++ would change ..)

python/pyarrow/_compute.pyx

Add docstrings for parameters of the various function options classes. Automatically ingest those parameter docstrings into the generated compute function docstrings (this uses vendored code from `numpydoc`). Example with the `min_max` parameter docs: * before: ``` array : Array-like Argument to compute function skip_nulls : optional Parameter for ScalarAggregateOptions constructor. Either `options` or `skip_nulls` can be passed, but not both at the same time. min_count : optional Parameter for ScalarAggregateOptions constructor. Either `options` or `min_count` can be passed, but not both at the same time. options : pyarrow.compute.ScalarAggregateOptions, optional Parameters altering compute function semantics. memory_pool : pyarrow.MemoryPool, optional If not passed, will allocate memory from the default memory pool. ``` * after: ``` array : Array-like Argument to compute function. skip_nulls : bool, default True Whether to skip (ignore) nulls in the input. If False, any null in the input forces the output to null. min_count : int, default 1 Minimum number of non-null values in the input. If the number of non-null values is below `min_count`, the output is null. options : pyarrow.compute.ScalarAggregateOptions, optional Alternative way of passing options. memory_pool : pyarrow.MemoryPool, optional If not passed, will allocate memory from the default memory pool. ```

ursabot · 2022-01-13T16:03:52Z

Benchmark runs are scheduled for baseline = 111347d and contender = 0c5cd73. 0c5cd73 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.45% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.35% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

pitrou requested a review from jorisvandenbossche January 4, 2022 19:39

github-actions bot added Component: C++ Component: Python labels Jan 4, 2022

pitrou changed the title ~~ARROW-10317: [Python] Document compute function options.~~ ARROW-10317: [Python] Document compute function options Jan 4, 2022

pitrou force-pushed the ARROW-10317-py-doc-function-options branch 2 times, most recently from 90e92cb to b2e0374 Compare January 10, 2022 14:24

pitrou force-pushed the ARROW-10317-py-doc-function-options branch from b2e0374 to 6647879 Compare January 11, 2022 13:38

amol- reviewed Jan 13, 2022

View reviewed changes

jorisvandenbossche reviewed Jan 13, 2022

View reviewed changes

pitrou added 4 commits January 13, 2022 12:05

Factor out docstrings for min_count and skip_nulls

938762c

Add docstring for RoundTemporalOptions

bb4b43b

Fix indentation of custom examples

69b44e2

pitrou force-pushed the ARROW-10317-py-doc-function-options branch from 6647879 to 69b44e2 Compare January 13, 2022 11:11

jorisvandenbossche approved these changes Jan 13, 2022

View reviewed changes

Address review comments

c6beb8c

pitrou force-pushed the ARROW-10317-py-doc-function-options branch from 0be4a14 to c6beb8c Compare January 13, 2022 12:35

pitrou closed this in 0c5cd73 Jan 13, 2022

pitrou deleted the ARROW-10317-py-doc-function-options branch January 13, 2022 15:52

-        Generated values are uniformly-distributed, double-precision """ +
-                                                """in range [0, 1).
+        Generated values are uniformly-distributed, double-precision \
+in range [0, 1).

ARROW-10317: [Python] Document compute function options #12076

ARROW-10317: [Python] Document compute function options #12076

Uh oh!

Conversation

pitrou commented Jan 4, 2022

Uh oh!

pitrou commented Jan 4, 2022

Uh oh!

github-actions bot commented Jan 4, 2022

Uh oh!

pitrou commented Jan 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amol- Jan 13, 2022

Choose a reason for hiding this comment

Uh oh!

pitrou Jan 13, 2022

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Jan 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amol- Jan 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pitrou Jan 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Jan 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Jan 13, 2022

Choose a reason for hiding this comment

Uh oh!

amol- Jan 13, 2022

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Jan 13, 2022

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Jan 13, 2022

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jorisvandenbossche Jan 13, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jorisvandenbossche Jan 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ursabot commented Jan 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pitrou commented Jan 11, 2022 •

edited

Loading

jorisvandenbossche Jan 13, 2022 •

edited

Loading

amol- Jan 13, 2022 •

edited

Loading

pitrou Jan 13, 2022 •

edited

Loading

jorisvandenbossche Jan 13, 2022 •

edited

Loading

jorisvandenbossche Jan 13, 2022 •

edited

Loading

ursabot commented Jan 13, 2022 •

edited

Loading