Skip to content

Conversation

@krcrouse
Copy link

This PR addresses both the JIRA issue cited (pre-generate pyarrow.compute) and also a dev thread that suggests creating the ability to add in python docs for functions that inherit from the Arrow C++ would greatly improve the readability for python users.

There are still a few things to work out, such as where in the build process to generate the code and whether a version of the generated code should be checked into version control or not, but @pitrou suggested opening the PR to field comments from developers.

Major points:

  • creates python/docs/additions tree where the reStructrued text docs that include the sections to overwrite. Using raw reSt so that code block examples can be tested using doctest - see the README for more verbose details
  • pyarrow.docutils (or maybe should be _docutils) provides functions to processes python/docs/additions and return a data structure of the components per function.
  • python/scripts/generate_sources.py uses pyarrow.docutils and writes out the code for the compute functions in pyarrow/generated/compute.py. All of the logic from the release-branch pyarrow.compute module that dynamically generated the compute functions has been moved to this script.
  • I didn't check the generated file into the repo because I generally do not include generated files that would be generated by the build process should be in source control, but I realize there are other perspectives on this
  • pyarrow.compute now imports from pyarrow.generated.compute for all of the autogenerated compute bindings. Override and custom functions are still defined here.
  • The old pyarrow._compute_docstrings is gone because its purpose is subsumed in the above.
  • I've updated the tests so that they work with the above changes.

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@krcrouse
Copy link
Author

To save you all from having to check out the branch and generate the code, attached is what the python/pyarrow/generated/compute.py file looks like at present (added as a .txt file because github rejects .py)
compute.py.txt
.

@pitrou
Copy link
Member

pitrou commented Jul 7, 2022

@krcrouse Just a word to tell you that I hadn't a chance to take a look yet, but it's definitely on my TODO list (or may be switched to someone else's :-)). I hope the wait isn't too demotivating.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jul 7, 2022

@krcrouse thanks a lot for your work on this!

I didn't yet look in detail, but some general comments / questions on the overall appraoch:

  • I think we will probably want to commit the generated code to the repo (and in any case, doing that for now might also make it easier to review (although you already linked the file above as well)).

  • Can you explain a bit more the need for pyarrow.docutils? (we should also make it a private module, or put in eg python/scripts or so?)
    As far as I understand, by using the full ability of docutils, we can override any section of the docstring (eg also overriding the Parameters section), instead of only appending content to the docstring? (eg appending the example of notes section).
    Do we think we will really need that full ability? Or it might in almost all cases be sufficient to just append content? (I think for all cases you currently have additions in this PR, appending would also be fine?) If appending is sufficient, that might keep the generation code simpler?

  • pyarrow.compute now imports from pyarrow.generated.compute

    Small nit: this doesn't need to be a public module, so maybe something like _compute_generated.py instead?

  • Using raw reSt so that code block examples can be tested using doctest

    We now started to tests docstring examples in general in the meantime (see ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (--doctest-modules) #13199, ARROW-16018: [Doc][Python] Run doctests on Python docstring examples (CI job) #13216, ARROW-16709: [Docs][Python] Add how to run doctests to the developer guide #13325), so we can probably test those docstrings that way as well.

@krcrouse
Copy link
Author

@jorisvandenbossche, I'll wait for some more guidance from you all and just respond to your comments inline for the moment.

  • I think we will probably want to commit the generated code to the repo (and in any case, doing that for now might also make it easier to review (although you already linked the file above as well)).

That makes sense and probably helps in the overall structure.

  • Can you explain a bit more the need for pyarrow.docutils?

docutils is used to process the reStructured text appendices so that they can be merged with the autogenerated docs. In this model, I propose using reSt for the function documentation additions because it's established, it's testable, and contributors will at least understand what it is even if they're not proficient in it.

If your question is more about "the need for it as a required module" - If we include the generated files then the docutils requirement only needs to be a build dependency in order to generate the actual code files. It does not need to be included as a requirement for the actual package for end users.

(we should also make it a private module, or put in eg python/scripts or so?)

Agreed - it should be a private module.

As far as I understand, by using the full ability of docutils, we can override any section of the docstring ...
Do we think we will really need that full ability? Or it might in almost all cases be sufficient to just append content?

I think we would want both options. Since the default documentation is pulling from the C++ library code, I think you could browse the current generated documentation and see sections that are not useful and could be entirely overwritten. I also think the hybrid approach of creating default documentation with options to append and/or overwrite is best because it will pull in changes to the core C++ function interface automatically while preserving the manually provided improved pythonic documentation.

Take, for example, the parameter definitions of replace_with_mask - in the context of the still not terribly verbose description, the parameter types are incorrect and the explanation of each parameter is useless ('Argument to compute function."). I don't have the code in front of me at the moment, but having implemented this function in pyarrow 7.0, there are several oddities in how these parameters are handled that should be in the param descriptions and not merely as an appended paragraph at the bottom.

Small nit: this doesn't need to be a public module, so maybe something like _compute_generated.py instead?

  • Using raw reSt so that code block examples can be tested using doctest

Agreed.

@krcrouse krcrouse force-pushed the generate-pyarrow-compute-and-improve-docs branch from 084a1ad to 57f2ea2 Compare July 24, 2022 15:38
@krcrouse
Copy link
Author

Hi, @jorisvandenbossche and @pitrou,

I've pushed new updates to this branch based on the comments, including the fully generated source files in pyarrow/_compute_generated.py

I've resolved updates to the tests and included the new compute functions that have been added since the original creation of the branch.

In line with the movement towards docutils, the following will test all of the examples that get pulled into the pyarrow.compute function documentation: python -m doctest -v docs/additions/compute/*

@krcrouse krcrouse force-pushed the generate-pyarrow-compute-and-improve-docs branch 2 times, most recently from 7c8a3bd to e90c6cf Compare August 5, 2022 13:45
@krcrouse krcrouse force-pushed the generate-pyarrow-compute-and-improve-docs branch from ba2bacd to 8c52f62 Compare August 25, 2022 01:29
@krcrouse
Copy link
Author

Hi @jorisvandenbossche and @pitrou,

I made a number of additions so that the autogenerated _compute.py code conformed with flake8, since it doesn't distinguish between hand written and autogenerated code. I also updated the generated function signatures to be compliant with Python 3.7.

Please let me know how you would like to move forward on this / if you do.

If I'm reading the workflow output correctly, I think the only issue right now is that archery is failing due to the lack of licenses in the ReSTructured text files, and I am not sure how that can be added as I don't know of any standard for adding licenses to ReST (output from workflow failure below)

INFO:archery:Running Docker linter
apache-rat license violation: python/docs/additions/compute/all.rst
apache-rat license violation: python/docs/additions/compute/any.rst
apache-rat license violation: python/docs/additions/compute/count.rst
apache-rat license violation: python/docs/additions/compute/count_distinct.rst
apache-rat license violation: python/docs/additions/compute/filter.rst
apache-rat license violation: python/docs/additions/compute/index.rst
apache-rat license violation: python/docs/additions/compute/indices_nonzero.rst
apache-rat license violation: python/docs/additions/compute/mode.rst
apache-rat license violation: python/docs/additions/test_example.rst

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module is only needed for the automatic generation, right? So in case we check in the generated file, this is only needed for developers, and can maybe be moved outside of pyarrow? (we don't need to ship this in the packages) For example also in python/scripts ?

@jorisvandenbossche
Copy link
Member

@krcrouse thanks for the updates! And sorry for the slow follow-up on your responses.

As far as I understand, by using the full ability of docutils, we can override any section of the docstring ...
Do we think we will really need that full ability? Or it might in almost all cases be sufficient to just append content?

I think we would want both options. Since the default documentation is pulling from the C++ library code, I think you could browse the current generated documentation and see sections that are not useful and could be entirely overwritten. I also think the hybrid approach of creating default documentation with options to append and/or overwrite is best because it will pull in changes to the core C++ function interface automatically while preserving the manually provided improved pythonic documentation.

(yes, to be clear my question about the need for docutils was about this (do we think appending is sufficient, or do we want the more fine grained replacement), and not to question to use of restructuredtext)

I am personally still a bit hesitant to go the full docutils way here. I certainly see the value of the flexibility it provides, but it also does introduce quite some additional code that needs to be maintained for this. And for now, all the doc additions are pure "append" ones, which could be implemented with much less code (although I know it's the goal to expand the set of doc additions, of course).

Take, for example, the parameter definitions of replace_with_mask - in the context of the still not terribly verbose description, the parameter types are incorrect and the explanation of each parameter is useless ('Argument to compute function.")

Yes, fully agreed that's basically useless. Those are dummy auto-generated on the python side, and the more general solution might be to actually start including argument descriptions in the C++ docs, so we can pull that as well into the python docstrings. There is a TODO about this (cc @pitrou):

/// \brief Symbolic names (identifiers) for the function arguments.
///
/// Some bindings may use this to generate nicer function signatures.
std::vector<std::string> arg_names;

Although of course, if only Python would make use of those extra descriptions, we could maybe as well keep them on the python side ..

(to be clear, I am not saying that we certainly don't want the more advanced docutils-based approach, but some input from others would be welcome)

@jorisvandenbossche
Copy link
Member

Would we need a check in CI that ensures the generated compute file is up-to-date? (I am not directly sure how we do that in other places with generated files)

@krcrouse
Copy link
Author

Would we need a check in CI that ensures the generated compute file is up-to-date? (I am not directly sure how we do that in other places with generated files)

Yes. One straightforward (though arguably not terribly elegant) way to check this would be to have the CI task run python scripts/generate_sources.py with a flag to redirect output and ensure that the output was exactly the same as the python/pyarrow/_compute_generated.py file.

@krcrouse krcrouse force-pushed the generate-pyarrow-compute-and-improve-docs branch from 7626d88 to 3111ea3 Compare September 2, 2022 01:53
@krcrouse
Copy link
Author

krcrouse commented Sep 4, 2022

@jorisvandenbossche ,

Here's the run down of the most recent push to the branch, which includes updates from upstream as of the end of last week:

Quick Summary Points (many that condense prior one-off comments):

  • Based on your comment, I moved python/pyarrow/_rstutils.py into python/scripts/lib/arrowdoc.py as it is not needed. I don't love a sub-lib of scripts, but it's at least straightforward to then have the generate_sources.py script reference it.
  • As I had replied earlier, there should eventually have a step added to the build or CI process that verifies the committed version of python/pyarrow/_compute_generated.py is the same thing that is output from the generate_sources.py script.
  • The archery lint --rat test is failing because the rst files in python/docs/additions/compute/ don't have license text and I do not know how do that in a sensible way and don't see it in docs or examples.

As for the points you brought up about docutils - I wonder if you are misunderstanding its role and how I use it in the PR. It's probably more straightforward with the reorganization in this round. Now, the docutils modules is only used in the scripts/lib/arrodoc.py file, and there it is only used to parse the reST files in python/docs/additions/compute/ into the document tree that is then merged/replaced/appended to the docs pulled from the C++ libraries. I did add it to the requirements-build and requirements-wheel-build files for the PR, but neither of those are really accurate. It's required solely to run the script, and so I felt that it would need to go into some sort of "requirements," but if you're not regenerating the compute functions for python, it's not required for anything.

@amol-
Copy link
Member

amol- commented Mar 30, 2023

Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍

@jorisvandenbossche
Copy link
Member

@krcrouse sorry for letting this slip my mind. I will try to take a look at the latest version and your last comment shortly.

@krcrouse
Copy link
Author

krcrouse commented Apr 4, 2023

@krcrouse sorry for letting this slip my mind. I will try to take a look at the latest version and your last comment shortly.

@jorisvandenbossche, sure let me know if this is something that you would like to pursue. I haven't updated it in quite a while since there hasn't been much interest but I'd be willing to do so if there's a desire to incorporate it.

@kszucs
Copy link
Member

kszucs commented Mar 31, 2025

@jorisvandenbossche shall we keep this PR open?

@pitrou
Copy link
Member

pitrou commented Mar 31, 2025

Perhaps @AlenkaF or someone else would be interested in reviving this?

@AlenkaF
Copy link
Member

AlenkaF commented Apr 1, 2025

I’m definitely interested! If someone else takes it first, feel free to go ahead, as I won’t be able to do it right away.

@kcphila
Copy link

kcphila commented Apr 1, 2025

@AlenkaF , @pitrou - I'm the original submitter. I'm more than willing to update it for the current state of the repo. I just stopped pinging back whenever because no one was interested in this work.

@AlenkaF
Copy link
Member

AlenkaF commented Apr 1, 2025

I am happy to help with reviews if you are interested to update the PR @kcphila. Thank you for being patient!

@AlenkaF
Copy link
Member

AlenkaF commented Jun 17, 2025

@kcphila I'm happy to rebase and resolve any conflicts to help move this along.
Let me know where I can be of help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants