-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
ENH: Use flake8 to check for PEP8 violations in doctests #23399
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Use flake8 to check for PEP8 violations in doctests #23399
Conversation
Signed-off-by: Fabian Haase <haase.fabian@gmail.com>
|
Hello @FHaase! Thanks for submitting the PR.
|
|
@FHaase : Thanks for the PR! Given your first comment in the description, I would recommend opening an issue first. That's generally what we would recommend instead of jumping right to the PR. |
|
Thanks @FHaase, this looks much better. Can you mention the issue you close, and the other PR related to this in the description please. I just saw couple of things. Doesn't feel like we should be using a module named legacy. I guess you checked it, but doesn't flake8 have a newer module if they call that one legacy? Also, feels like when we validate a docstring, we validate all the pep8 problems in the file, right? We should find a way to validate a single docstring. When this is done, we should also implement tests for this, as the rest of validations. In any case, good work, this will be very helpful once is done. |
Yes I mentioned it. The only way of fixing this might be to create a temp-file just with the docstring in it and run flake8 on that? |
Signed-off-by: Fabian Haase <haase.fabian@gmail.com>
setup.cfg
Outdated
| ./pandas/computation | ||
| ./pandas/core | ||
| ./pandas/errors | ||
| ./pandas/io |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a number of these have been removed recently, can you edit
* updated folders excluded from flake8 --doctests * fixed failing doctests Signed-off-by: Fabian Haase <haase.fabian@gmail.com>
* Examples in `test_validate_docstrings.py` lacked imports * Removed signature from function in generated file as not really needed at this point. * Used `clean_doc` instead of `raw_doc` to get correct indentation. * Added missing try: finally: block in `_file_representation()` Signed-off-by: Fabian Haase <haase.fabian@gmail.com>
9d7b896 to
762de5d
Compare
Codecov Report
@@ Coverage Diff @@
## master #23399 +/- ##
=======================================
Coverage 92.23% 92.23%
=======================================
Files 161 161
Lines 51197 51197
=======================================
Hits 47220 47220
Misses 3977 3977
Continue to review full report at Codecov.
|
46cd0a4 to
844065c
Compare
Dug down into the code and basically recreated the legacy api. |
8b5c95d to
9c09bf1
Compare
* added test * simplified temporary file usage * included flake8 by calling application directly * removed doctests from setup.cfg Signed-off-by: Fabian Haase <haase.fabian@gmail.com>
9c09bf1 to
9bc7f65
Compare
datapythonista
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments, mainly about code clarity and simplicity. But looks good, I think we're almost there.
|
|
||
| class BadExamples(object): | ||
|
|
||
| def doctest(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think doctest is not a very descriptive name if this test is to check whether pep8 should fail because of unknown np. The rest of the example is very verbose to just check that. I'd use the description to explain what is this test about. And in the examples just with >>> np.nan could be enough.
Then, couple more tests could be added, to test the typical pep8 issues we have, like >>> foo = 1+2 (missing spaces)...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason to use flake8 to check for pep8 issues was to not have to write all these tests. flake8 should be tested enough to work properly.
This test basically only verifies that flake8 is actually testing the docstrings.
scripts/validate_docstrings.py
Outdated
| import tempfile | ||
| from contextlib import contextmanager | ||
|
|
||
| from flake8.main.application import Application as Flake8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it complicates things unnecessarily importing objects other than modules. import contextlib and import flake8.main.application seems like the simplest and best way. And then use accordinly @contextlib.contextmanager and app = flake8.main.application.Application. This way, just by reading those lines it's obvious what is the module and what is being used, and there is not need to go to the imports and start decoding aliases.
scripts/validate_docstrings.py
Outdated
| ] | ||
|
|
||
| @contextmanager | ||
| def _file_representation(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name is not very descriptive. I think this is the examples as temporary file, so something like _examples_as_temp_file seems much clearer.
Not sure if it's worth having two separate methods, I'd generate the file in the method that validates pep8. But if we have two methods, I'd have this first, as this is being used by the other (pep8_violations).
scripts/validate_docstrings.py
Outdated
| Temporarily creates file with current function inside. | ||
| The signature and body are **not** included. | ||
| :returns file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we use numpydoc, not this syntax
scripts/validate_docstrings.py
Outdated
| lines = self.clean_doc.split("\n") | ||
| indented_lines = [(' ' * 4) + line if line else '' | ||
| for line in lines[1:]] | ||
| doc = '\n'.join([lines[0], *indented_lines]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused with this. I guess I'm missing something. If in our docstring we have something like:
Some desc.
Examples
--------
>>> val = 1+2
>>> print(val)
I'd expect to have in the temporary file to validate:
val = 1+2
print(val)
As we won't validate the description, or anything else. I think numpydoc returns the lines with code, so writing to the file should be a single line of code. Is there any advantage generating a function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well flake8 supports doctests by adding --doctests as an argument. And the Examples follow the doctest standard.
Examples
An optional section for examples, using the doctest format.
Having to carve out the examples is not trivial and would require writing multiple tests. So creating a file with a dummy-function is a simple hack, getting ugly cause it has to restore the indentation of the docstring. One test of the BadExamples class ensures it is working.
scripts/validate_docstrings.py
Outdated
| errs.append('Summary contains heading whitespaces.') | ||
| elif (doc.is_function_or_method | ||
| and doc.summary.split(' ')[0][-1] == 's'): | ||
| and doc.summary.split(' ')[0][-1] == 's'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it'd be better to not do unrelated changes
scripts/validate_docstrings.py
Outdated
| return [ | ||
| "{} {} {}".format(s.count, s.error_code, s.message) | ||
| for s in stats.statistics_for('') | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd personally prefer to yield from application.guide.stats.statistics_for('') instead of returning the string. In case the information is used in more than one place, having the components can be useful.
Also, it's an opinion, but I wouldn't use a property here, as this feels more like an action than an attribute. A method validate_pep8 seems easier to understand.
Signed-off-by: Fabian Haase <haase.fabian@gmail.com>
datapythonista
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great. Still few comments. And we'll probably have to wait to #23161 to reuse the code extraction from the examples, but it should be merged very soon.
| Examples | ||
| -------- | ||
| >>> import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't realize before, but there is something tricky here. The convention is to assume numpy and pandas are always imported before the examples. I personally would be explicit, but that's how all the examples are made.
This means two things:
- pep8 will fail because of this, unless we add the imports when validating
- we should use different examples for the tests (may be not imorting
math?)
And as I said in the previous review, I'd like to have more examples of pep8 failures, like missing spaces around an operator. Also, these examples are too long for what is being tested.
scripts/validate_docstrings.py
Outdated
| import doctest | ||
| import tempfile | ||
|
|
||
| from flake8.main import application as flake8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please, just import flake8.main.application, and then use app = flake8.main.application.Application(). It makes things confusing to use aliases, and even more if the name of the alias is the name of the root module of what is being imported.
scripts/validate_docstrings.py
Outdated
| def validate_pep8(self): | ||
| create_function = 'def {name}():\n' \ | ||
| ' """\n{examples}\n """\n' \ | ||
| ' pass\n' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As said, I have a preference to have only the examples code in the temporary file. Extracting the code from the examples is trivial, and once #23161 is merged we'll have a property in this class that returns the lines of code.
What you will have to do is before writing the code in the examples, write import numpy as np and same for pandas, so flake8 does not complain because of that.
scripts/validate_docstrings.py
Outdated
|
|
||
| application = flake8.Application() | ||
| application.initialize(["--doctests", "--quiet"]) | ||
| application.run_checks([file.name]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you just leave the writing to the file and this line inside the context manager, and creating the app object, initializing... have it outside?
scripts/validate_docstrings.py
Outdated
| for param_err in param_errs: | ||
| errs.append('\t{}'.format(param_err)) | ||
|
|
||
| pep8_errs = [error for error in doc.validate_pep8()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is easier to read:
| pep8_errs = [error for error in doc.validate_pep8()] | |
| pep8_errs = list(doc.validate_pep8()) |
Co-Authored-By: FHaase <haase.fabian@gmail.com>
Signed-off-by: Fabian Haase <haase.fabian@gmail.com>
datapythonista
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks quite good, added a couple of comments about what to expect from #23161, and some style preferences, but we're almost there.
I'm really looking forward to have this merged. After that I'll add this validation to the CI, so we don't need to keep reviewing the pep8 of examples manually, which is a significant amount of work.
Thanks for all the work on this.
| """ | ||
| pass | ||
|
|
||
| def pandas_imported(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the misunderstanding. numpy_imported and pandas_imported and the corresponding tests are not needed, as that is already tested in #23161.
As mentioned in the previous review, forget about these two imports in the examples in this PR. You are repeating what's implemented in #23161 which will be merged earlier.
| ' are imported as pd or np',)), | ||
| ('BadExamples', 'pandas_imported', | ||
| ('F811 It\'s assumed that pandas and numpy' | ||
| ' are imported as pd or np',)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these two examples need to be removed
scripts/validate_docstrings.py
Outdated
| """ | ||
| content = ''.join(('import numpy as np; ' | ||
| 'import pandas as pd # noqa: F401,E702\n', | ||
| *self.examples_source_code)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't it be simpler to join by \n instead, and remove the semicolon and the noqa?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- regarding E702: with one line the great thing was to achieve intercepting of the error message.
- regarding F401: All examples not using both: np and pd would fail because of an unused import.
The implementation of self.examples_source_code of PR #23161 returns an array with \n at the end, so joining with \n isn't possible.
scripts/validate_docstrings.py
Outdated
| if statistic.message.endswith('from line 1'): | ||
| statistic.message = "It's assumed that pandas and numpy" \ | ||
| " are imported as pd or np" | ||
| yield statistic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess with the original yield from we may get a pep8 error of duplicate import, or something similar. I'm fine with that. Besides the pep8 errors, we'll have the error of the other section implemented in #23161, which should make it very easy for users to identify the problem.
So, I'd simply leave the yield from ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found F811 redefinition of unused 'np' from line 1 could be confusing when read first.
scripts/validate_docstrings.py
Outdated
| application = flake8.main.application.Application() | ||
| application.initialize(["--quiet"]) | ||
| application.run_checks([file.name]) | ||
| application.report() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I wasn't clear in my previous review about this. This is what I'd do:
application = flake8.main.application.Application()
application.initialize(["--quiet"])
with tempfile.NamedTemporaryFile(mode='w') as f:
f.write(content)
f.flush()
application.run_checks([f.name])
application.report()
I think it's a good practice to have inside the context manager, the code that requires to be there.
Also, as this file is going quite significantly, I think it's better to keep all this functionality in a single function, so each validation is somehow isolated. I think in another context it makes more sense what you did of splitting, but I don't think we'll reuse the temporary file creation, and I think the improvement in simplicity makes it worth.
scripts/validate_docstrings.py
Outdated
| yield file | ||
|
|
||
| def validate_pep8(self): | ||
| if self.examples: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small thing, but my preference is having:
if not self.examples:
return
this avoids having the rest of the function indented, and in my opinion increases readability.
Signed-off-by: Fabian Haase <haase.fabian@gmail.com>
…docstrings Signed-off-by: Fabian Haase <haase.fabian@gmail.com>
Signed-off-by: Fabian Haase <haase.fabian@gmail.com>
datapythonista
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if not self.examples: | ||
| return | ||
|
|
||
| content = ''.join(('import numpy as np # noqa: F401\n', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be we can add a comment before this line explaining why the F401 is required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well the content basically never shows as it's only existing within tmp files. And if someone thinks it's unnecessary the tests would fail.
So I don't think commenting is needed there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment was not for the temp file, but our code, and I think it's useful to know what the F401 is about when reading this part of code (without having to remove it and see that fails). Anyway, this is merged now. :)
jreback
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merge when satisfied
|
Thanks @FHaase, extremely useful PR |
git diff upstream/master -u -- "*.py" | flake8 --diffImprovement to PR #23381 regarding issue #23154
Enables
--doctestsflag, resulting in doctests being checked by flake8Uses flake8.api.legacy to invoke the tests.
Downsides are:
flake8.get_style_guidedon't have an effect. Thussetup.cfghad to be enhanced.python ./scripts/validate_docstrings.py pandas.read_excelHelp in the form of
Use "flake8 pandas/util/_decorators.py" for further information.could be added to the output: