-
-
Notifications
You must be signed in to change notification settings - Fork 19.5k
Clean / Consolidate pandas/tests/io/test_html.py #20293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| return row.xpath('.//td|.//th') | ||
|
|
||
| def _parse_tr(self, table): | ||
| expr = './/tr[normalize-space()]' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wasn't clear to me why normalize-space() was added here. It is inconsistent with how bs4 parses tr elements and was actually causing a failure in test_computer_sales_page.
Should the need to strip trailing / leading whitespace come back up I think it would be better done in the base class than only implementing here in lxml
| r = parse(self.io, parser=parser) | ||
|
|
||
| if _is_url(self.io): | ||
| with urlopen(self.io) as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This conditional is required for Py27 compat - in Python3 you can simply provide a call to urlopen on self.io directly as an argument to parse (i.e. without explicitly using the context manager)
| scheme = parse_url(self.io).scheme | ||
| if scheme not in _valid_schemes: | ||
| # lxml can't parse it | ||
| msg = (('{invalid!r} is not a valid url scheme, valid ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than creating a custom error message here I re-raised. This makes the behavior consistent between lxml and bs4, allowing the test_bad_url_protocol and test_invalid_url tests to pass
| <td class="closing">April 19, 2013</td> | ||
| <td class="updated">April 23, 2013</td> | ||
| </tr> | ||
| <tr> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this element was intentionally removed from the test, but the two parsers did take a different path to parsing. bs4 would "insert" the leading tr tag while lxml would remove the trailing tag.
On one hand this challenges if we really want to give lxml as much power in parsing here since most browsers (OK at least Safari and Chrome on my computer) matched the bs4 behavior, but on the other hand I'm not sure if it's generalizable to say that the magical insertion of this or like elements by bs4 would always be desired, and perhaps its just a risk that the user accepts when parsing malformed HTML?
| from lxml.etree import XMLSyntaxError | ||
|
|
||
| parser = HTMLParser(recover=False, encoding=self.encoding) | ||
| parser = HTMLParser(recover=True, encoding=self.encoding) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a change in behavior giving lxml a little more power to work through malformed HTML. May or may not be acceptable (see other comments)
| cols = [_remove_whitespace(x.text_content()) for x in | ||
| self._parse_td(tr)] | ||
| # Grab any directly descending table headers first | ||
| ths = thead[0].xpath('./th') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because _parse_td with this parser doesn't really differentiate between td and th elements it was incorrectly parsing headers for things like spam.html where td and th elements are intermixed in the header. Hence to make the parsing more robust and pass the tests, I added an initial search for th elements before falling back to the existing behavior.
Even with that I'd argue it's confusing that _parse_td is implemented to return td and th elements and should be refactored to more clearly delineate, but I am trying to minimize behavior change with this PR
pandas/tests/io/test_html.py
Outdated
| _skip_if_no('lxml') | ||
|
|
||
| @pytest.mark.xfail | ||
| def test_data_fail(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the changes I made this no longer fails but instead parses. If we are comfortable with that this should probably be removed, though I put as xfail for visibility atm
| bad = UnseekableStringIO(''' | ||
| <table><tr><td>spam<foobr />eggs</td></tr></table>''') | ||
|
|
||
| assert self.read_html(bad) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this test to remove a seek back to the start of the file that lxml was fine with. By definition of unseekable I'm not sure how that seek call was allowed...
pandas/tests/io/test_html.py
Outdated
|
|
||
| def _missing_bs4(): | ||
| bs4 = td.safe_import('bs4') | ||
| if not bs4 or LooseVersion(bs4.__version__) == LooseVersion('4.2.0'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that would be ok requiring a later version of bs4 to avoid some of the older issues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the min version. FWIW in ci/requirements-2.7_COMPAT.pip the requirement is hardcoded at 4.2.0 and ci/requirements-2.7_LOCALE.pip has 4.2.1. I'm not sure the backstory to those but I assume that will cause conflict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be rooted back in #4259?
|
Hello @WillAyd! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on March 13, 2018 at 23:39 Hours UTC |
Codecov Report
@@ Coverage Diff @@
## master #20293 +/- ##
=========================================
Coverage ? 91.74%
=========================================
Files ? 150
Lines ? 49154
Branches ? 0
=========================================
Hits ? 45097
Misses ? 4057
Partials ? 0
Continue to review full report at Codecov.
|
jreback
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a note indicating the new min for bs4, may need to update install.rst, does this have equiv tests to before (lots of remove code is good!)
pandas/tests/io/test_html.py
Outdated
| _skip_if_none_of(('bs4', 'html5lib')) | ||
| def _missing_bs4(): | ||
| bs4 = td.safe_import('bs4') | ||
| if not bs4 or LooseVersion(bs4.__version__) == LooseVersion('4.2.0'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4.2.1
|
Missed your question before but the way this worked is that I combined all of the lxml and bs4 test classes into one, deleting tests where duplicated and also moving in some of the top level tests that weren't being tested for both (ex: The following were deleted:
|
| +-----------------+-----------------+----------+ | ||
| +-----------------+-----------------+----------+---------------+ | ||
| | Package | Minimum Version | Required | Issue | | ||
| +=================+=================+==========+===============+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
| @@ -1,4 +1,4 @@ | |||
| beautifulsoup4 | |||
| beautifulsoup4>=4.2.1 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to change this:
ci/requirements-2.7_COMPAT.pip:beautifulsoup4==4.2.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Figured as such. Without knowing the full impact, is that req file for a Travis build that tests the 4.2.0 incompatibility (ref #4259). If so do we even need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just change it to the new minimum version 4.2.1 (we have 1 other build which is pinned as well). always like to test 1 or 2 builds for the min.
|
CircleCI failure was a timeout - don't believe it's related to this change |
|
thanks note I saw a single skip on my local |
|
|
Yep there was one test that was testing the failure message when trying to read an unseekable IO object twice. lxml actually didn't have any problem reading it twice, so I placed an imperative skip for that parser in the test |
This is an extremely aggressive change to get all of the test cases unified between the LXML and BS4 parsers. On the plus side both parsers can share the same set of tests and functionality, but on the downside it gives lxml a little more power than it had previously, where it would quickly fall back to bs4 for malformed sites.
Review / criticism appreciated