Skip to content

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented Jan 4, 2022

Add docstrings for parameters of the various function options classes.
Automatically ingest those parameter docstrings into the generated compute function docstrings
(this uses vendored code from numpydoc).

Example with the min_max parameter docs:

  • before:
array : Array-like
    Argument to compute function
skip_nulls : optional
    Parameter for ScalarAggregateOptions constructor. Either `options`
    or `skip_nulls` can be passed, but not both at the same time.
min_count : optional
    Parameter for ScalarAggregateOptions constructor. Either `options`
    or `min_count` can be passed, but not both at the same time.
options : pyarrow.compute.ScalarAggregateOptions, optional
    Parameters altering compute function semantics.
memory_pool : pyarrow.MemoryPool, optional
    If not passed, will allocate memory from the default memory pool.
  • after:
array : Array-like
    Argument to compute function.
skip_nulls : bool, default True
    Whether to skip (ignore) nulls in the input.
    If False, any null in the input forces the output to null.
min_count : int, default 1
    Minimum number of non-null values in the input.  If the number
    of non-null values is below `min_count`, the output is null.
options : pyarrow.compute.ScalarAggregateOptions, optional
    Alternative way of passing options.
memory_pool : pyarrow.MemoryPool, optional
    If not passed, will allocate memory from the default memory pool.

@pitrou
Copy link
Member Author

pitrou commented Jan 4, 2022

cc @amol-

@github-actions
Copy link

github-actions bot commented Jan 4, 2022

@pitrou pitrou changed the title ARROW-10317: [Python] Document compute function options. ARROW-10317: [Python] Document compute function options Jan 4, 2022
@pitrou pitrou force-pushed the ARROW-10317-py-doc-function-options branch 2 times, most recently from 90e92cb to b2e0374 Compare January 10, 2022 14:24
@pitrou
Copy link
Member Author

pitrou commented Jan 11, 2022

Ping @jorisvandenbossche

It would be nice for this PR to be reviewed soon because it needs rebasing and updating every time a new compute function is added.

@pitrou pitrou force-pushed the ARROW-10317-py-doc-function-options branch from b2e0374 to 6647879 Compare January 11, 2022 13:38
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure there is a strong guarantee that the options of skip_nulls and min_count will always forever match with those of ScalarAggregate? I mean, now they match, but I'm concerned that in 6 months we will have forgotten that the docstrings influence multiple classes and might accidentally add options that don't exist in the classes that reuse the existing docstrings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean? skip_nulls and min_count are individual options, not functions.

Copy link
Member

@jorisvandenbossche jorisvandenbossche Jan 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean whether they will stay the same for ScalarAggregateOptions vs ElementWiseAggregateOptions vs ... (the different places it is being used)?

It seems quite unlikely to me that the general description will not be correct anymore for one of those functions, but we can also always then update it if that would change (also if we all inline them duplicated, we need to think about updating the docstring if behaviour in C++ would change ..)

Copy link
Member

@amol- amol- Jan 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I mean that there might be a point where the various places where they are used are not aligned anymore.
I'm not concerned about the current state of things, the functions report the docstrings of individual options etc, I'm mostly concerned that in 1 year from now we will forget how we designed that to be and someone might think that it's reasonable for example to add one more option in _skip_nulls_doc() causing a wrong option to be added to one of the Option classes using that function.

I guess one possible solution would be to have something inspecting the class and confirming the arguments in __init__ match with those documented. And throwing an error when they don't. At least developers would be informed when they break something.

Copy link
Member Author

@pitrou pitrou Jan 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

someone might think that it's reasonable for example to add one more option in _skip_nulls_doc() causing a wrong option to be added to one of the Option classes using that function.

Hmm, I don't understand. Why would someone add an option there?
The entire purpose of _skip_nulls_doc() is to factor the documentation for a single option skip_nulls. It's not documenting a hypothetical function named "skip_nulls" to which people would add other options.

Copy link
Member

@jorisvandenbossche jorisvandenbossche Jan 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess one possible solution would be to have something inspecting the class and confirming the arguments in __init__ match with those documented.

We have numpydoc validation check that already should do something like that? (I don't remember if it is ran by default, or is ran for the compute module)

But, that would also not really solve your concern that the "meaning" of the keyword would change for one of the functions? (checking that the signature keywords match with the documented keywords doesn't guarantee anything about whether the description content is correct or not)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't understand. Why would someone add an option there?

I suppose that Alessandro means something like the following: assume that at some point we add an additional accepted value to skip_nulls, and document it in the skip_nulls_doc() function. But that new value might only be accepted by a subset of the compute kernels that have a skip_nulls argument, and thus we would get an incorrect docstring for the others.

Now, since skip_nulls is a boolean keyword (True/False) currently, it seems not that likely there will be a new accepted value being added.
And again, if that happens, I think we can handle it at that time, and I think we will remember that this skip_nulls_doc() is used in several places, and thus have to verify if our changes are correct for all places where this is being used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, the numpydoc check would catch a set of issues. I'm wondering btw if it's actually running given that it didn't catch https://github.com/apache/arrow/blob/master/python/pyarrow/array.pxi#L714

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering btw if it's actually running given that it didn't catch https://github.com/apache/arrow/blob/master/python/pyarrow/array.pxi#L714

I think the check for proper whitespace around the colon is not yet enabled. For now, #7732 only enabled check "PR01", which checks if all parameters are present in the docstring

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On further inspection, it seems that archery numpydoc doesn't check class methods, opened https://issues.apache.org/jira/browse/ARROW-15321

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nice!

Comment on lines +734 to +735
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Generated values are uniformly-distributed, double-precision """ +
"""in range [0, 1).
Generated values are uniformly-distributed, double-precision \
in range [0, 1).

This is an alternative way to do this, doesn't need to end/start triple quotes, but because of the strange indentation not necessarily nicer ..

Copy link
Member

@jorisvandenbossche jorisvandenbossche Jan 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean whether they will stay the same for ScalarAggregateOptions vs ElementWiseAggregateOptions vs ... (the different places it is being used)?

It seems quite unlikely to me that the general description will not be correct anymore for one of those functions, but we can also always then update it if that would change (also if we all inline them duplicated, we need to think about updating the docstring if behaviour in C++ would change ..)

Add docstrings for parameters of the various function options classes.
Automatically ingest those parameter docstrings into the generated compute function docstrings
(this uses vendored code from `numpydoc`).

Example with the `min_max` parameter docs:

* before:
```
array : Array-like
    Argument to compute function
skip_nulls : optional
    Parameter for ScalarAggregateOptions constructor. Either `options`
    or `skip_nulls` can be passed, but not both at the same time.
min_count : optional
    Parameter for ScalarAggregateOptions constructor. Either `options`
    or `min_count` can be passed, but not both at the same time.
options : pyarrow.compute.ScalarAggregateOptions, optional
    Parameters altering compute function semantics.
memory_pool : pyarrow.MemoryPool, optional
    If not passed, will allocate memory from the default memory pool.
```

* after:
```
array : Array-like
    Argument to compute function.
skip_nulls : bool, default True
    Whether to skip (ignore) nulls in the input.
    If False, any null in the input forces the output to null.
min_count : int, default 1
    Minimum number of non-null values in the input.  If the number
    of non-null values is below `min_count`, the output is null.
options : pyarrow.compute.ScalarAggregateOptions, optional
    Alternative way of passing options.
memory_pool : pyarrow.MemoryPool, optional
    If not passed, will allocate memory from the default memory pool.
```
@pitrou pitrou force-pushed the ARROW-10317-py-doc-function-options branch from 6647879 to 69b44e2 Compare January 13, 2022 11:11
@pitrou pitrou force-pushed the ARROW-10317-py-doc-function-options branch from 0be4a14 to c6beb8c Compare January 13, 2022 12:35
@pitrou pitrou closed this in 0c5cd73 Jan 13, 2022
@pitrou pitrou deleted the ARROW-10317-py-doc-function-options branch January 13, 2022 15:52
@ursabot
Copy link

ursabot commented Jan 13, 2022

Benchmark runs are scheduled for baseline = 111347d and contender = 0c5cd73. 0c5cd73 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.45% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.35% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants