-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-12526: Pre-generating pyarrow.compute and creating a docstring additions system for pyarrow functions #13126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ARROW-12526: Pre-generating pyarrow.compute and creating a docstring additions system for pyarrow functions #13126
Conversation
|
|
|
To save you all from having to check out the branch and generate the code, attached is what the |
|
@krcrouse Just a word to tell you that I hadn't a chance to take a look yet, but it's definitely on my TODO list (or may be switched to someone else's :-)). I hope the wait isn't too demotivating. |
|
@krcrouse thanks a lot for your work on this! I didn't yet look in detail, but some general comments / questions on the overall appraoch:
|
|
@jorisvandenbossche, I'll wait for some more guidance from you all and just respond to your comments inline for the moment.
That makes sense and probably helps in the overall structure.
docutils is used to process the reStructured text appendices so that they can be merged with the autogenerated docs. In this model, I propose using reSt for the function documentation additions because it's established, it's testable, and contributors will at least understand what it is even if they're not proficient in it. If your question is more about "the need for it as a required module" - If we include the generated files then the
Agreed - it should be a private module.
I think we would want both options. Since the default documentation is pulling from the C++ library code, I think you could browse the current generated documentation and see sections that are not useful and could be entirely overwritten. I also think the hybrid approach of creating default documentation with options to append and/or overwrite is best because it will pull in changes to the core C++ function interface automatically while preserving the manually provided improved pythonic documentation. Take, for example, the parameter definitions of
Agreed. |
084a1ad to
57f2ea2
Compare
|
Hi, @jorisvandenbossche and @pitrou, I've pushed new updates to this branch based on the comments, including the fully generated source files in I've resolved updates to the tests and included the new compute functions that have been added since the original creation of the branch. In line with the movement towards docutils, the following will test all of the examples that get pulled into the pyarrow.compute function documentation: |
7c8a3bd to
e90c6cf
Compare
ba2bacd to
8c52f62
Compare
|
Hi @jorisvandenbossche and @pitrou, I made a number of additions so that the autogenerated _compute.py code conformed with flake8, since it doesn't distinguish between hand written and autogenerated code. I also updated the generated function signatures to be compliant with Python 3.7. Please let me know how you would like to move forward on this / if you do. If I'm reading the workflow output correctly, I think the only issue right now is that INFO:archery:Running Docker linter
apache-rat license violation: python/docs/additions/compute/all.rst
apache-rat license violation: python/docs/additions/compute/any.rst
apache-rat license violation: python/docs/additions/compute/count.rst
apache-rat license violation: python/docs/additions/compute/count_distinct.rst
apache-rat license violation: python/docs/additions/compute/filter.rst
apache-rat license violation: python/docs/additions/compute/index.rst
apache-rat license violation: python/docs/additions/compute/indices_nonzero.rst
apache-rat license violation: python/docs/additions/compute/mode.rst
apache-rat license violation: python/docs/additions/test_example.rst |
python/pyarrow/_rstutils.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This module is only needed for the automatic generation, right? So in case we check in the generated file, this is only needed for developers, and can maybe be moved outside of pyarrow? (we don't need to ship this in the packages) For example also in python/scripts ?
|
@krcrouse thanks for the updates! And sorry for the slow follow-up on your responses.
(yes, to be clear my question about the need for docutils was about this (do we think appending is sufficient, or do we want the more fine grained replacement), and not to question to use of restructuredtext) I am personally still a bit hesitant to go the full docutils way here. I certainly see the value of the flexibility it provides, but it also does introduce quite some additional code that needs to be maintained for this. And for now, all the doc additions are pure "append" ones, which could be implemented with much less code (although I know it's the goal to expand the set of doc additions, of course).
Yes, fully agreed that's basically useless. Those are dummy auto-generated on the python side, and the more general solution might be to actually start including argument descriptions in the C++ docs, so we can pull that as well into the python docstrings. There is a TODO about this (cc @pitrou): arrow/cpp/src/arrow/compute/function.h Lines 132 to 135 in b832853
Although of course, if only Python would make use of those extra descriptions, we could maybe as well keep them on the python side .. (to be clear, I am not saying that we certainly don't want the more advanced docutils-based approach, but some input from others would be welcome) |
|
Would we need a check in CI that ensures the generated compute file is up-to-date? (I am not directly sure how we do that in other places with generated files) |
Yes. One straightforward (though arguably not terribly elegant) way to check this would be to have the CI task run |
… for the '/' in the function signature
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
7626d88 to
3111ea3
Compare
|
Here's the run down of the most recent push to the branch, which includes updates from upstream as of the end of last week: Quick Summary Points (many that condense prior one-off comments):
As for the points you brought up about |
|
Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍 |
|
@krcrouse sorry for letting this slip my mind. I will try to take a look at the latest version and your last comment shortly. |
@jorisvandenbossche, sure let me know if this is something that you would like to pursue. I haven't updated it in quite a while since there hasn't been much interest but I'd be willing to do so if there's a desire to incorporate it. |
|
@jorisvandenbossche shall we keep this PR open? |
|
Perhaps @AlenkaF or someone else would be interested in reviving this? |
|
I’m definitely interested! If someone else takes it first, feel free to go ahead, as I won’t be able to do it right away. |
|
I am happy to help with reviews if you are interested to update the PR @kcphila. Thank you for being patient! |
|
@kcphila I'm happy to rebase and resolve any conflicts to help move this along. |
This PR addresses both the JIRA issue cited (pre-generate pyarrow.compute) and also a dev thread that suggests creating the ability to add in python docs for functions that inherit from the Arrow C++ would greatly improve the readability for python users.
There are still a few things to work out, such as where in the build process to generate the code and whether a version of the generated code should be checked into version control or not, but @pitrou suggested opening the PR to field comments from developers.
Major points:
python/docs/additionstree where the reStructrued text docs that include the sections to overwrite. Using raw reSt so that code block examples can be tested using doctest - see the README for more verbose detailspyarrow.docutils(or maybe should be _docutils) provides functions to processespython/docs/additionsand return a data structure of the components per function.python/scripts/generate_sources.pyusespyarrow.docutilsand writes out the code for the compute functions inpyarrow/generated/compute.py. All of the logic from the release-branchpyarrow.computemodule that dynamically generated the compute functions has been moved to this script.pyarrow.computenow imports frompyarrow.generated.computefor all of the autogenerated compute bindings. Override and custom functions are still defined here.pyarrow._compute_docstringsis gone because its purpose is subsumed in the above.