Clean / Consolidate pandas/tests/io/test_html.py #20293

WillAyd · 2018-03-12T04:12:28Z

This is an extremely aggressive change to get all of the test cases unified between the LXML and BS4 parsers. On the plus side both parsers can share the same set of tests and functionality, but on the downside it gives lxml a little more power than it had previously, where it would quickly fall back to bs4 for malformed sites.

Review / criticism appreciated

WillAyd · 2018-03-12T04:14:27Z

pandas/io/html.py

        return row.xpath('.//td|.//th')

    def _parse_tr(self, table):
-        expr = './/tr[normalize-space()]'


It wasn't clear to me why normalize-space() was added here. It is inconsistent with how bs4 parses tr elements and was actually causing a failure in test_computer_sales_page.

Should the need to strip trailing / leading whitespace come back up I think it would be better done in the base class than only implementing here in lxml

WillAyd · 2018-03-12T04:16:13Z

pandas/io/html.py

-            r = parse(self.io, parser=parser)
-
+            if _is_url(self.io):
+                with urlopen(self.io) as f:


This conditional is required for Py27 compat - in Python3 you can simply provide a call to urlopen on self.io directly as an argument to parse (i.e. without explicitly using the context manager)

WillAyd · 2018-03-12T04:18:04Z

pandas/io/html.py

-                scheme = parse_url(self.io).scheme
-                if scheme not in _valid_schemes:
-                    # lxml can't parse it
-                    msg = (('{invalid!r} is not a valid url scheme, valid '


Rather than creating a custom error message here I re-raised. This makes the behavior consistent between lxml and bs4, allowing the test_bad_url_protocol and test_invalid_url tests to pass

WillAyd · 2018-03-12T04:22:21Z

pandas/tests/io/data/banklist.html

 				<td class="closing">April 19, 2013</td>
 				<td class="updated">April 23, 2013</td>
 			</tr>
+			<tr>


I'm not sure if this element was intentionally removed from the test, but the two parsers did take a different path to parsing. bs4 would "insert" the leading tr tag while lxml would remove the trailing tag.

On one hand this challenges if we really want to give lxml as much power in parsing here since most browsers (OK at least Safari and Chrome on my computer) matched the bs4 behavior, but on the other hand I'm not sure if it's generalizable to say that the magical insertion of this or like elements by bs4 would always be desired, and perhaps its just a risk that the user accepts when parsing malformed HTML?

WillAyd · 2018-03-12T04:23:03Z

pandas/io/html.py

        from lxml.etree import XMLSyntaxError
-
-        parser = HTMLParser(recover=False, encoding=self.encoding)
+        parser = HTMLParser(recover=True, encoding=self.encoding)


This is a change in behavior giving lxml a little more power to work through malformed HTML. May or may not be acceptable (see other comments)

WillAyd · 2018-03-12T04:31:10Z

pandas/io/html.py

-                cols = [_remove_whitespace(x.text_content()) for x in
-                        self._parse_td(tr)]
+            # Grab any directly descending table headers first
+            ths = thead[0].xpath('./th')


Because _parse_td with this parser doesn't really differentiate between td and th elements it was incorrectly parsing headers for things like spam.html where td and th elements are intermixed in the header. Hence to make the parsing more robust and pass the tests, I added an initial search for th elements before falling back to the existing behavior.

Even with that I'd argue it's confusing that _parse_td is implemented to return td and th elements and should be refactored to more clearly delineate, but I am trying to minimize behavior change with this PR

WillAyd · 2018-03-12T04:33:23Z

pandas/tests/io/test_html.py

-        _skip_if_no('lxml')
-
+    @pytest.mark.xfail
    def test_data_fail(self):


With the changes I made this no longer fails but instead parses. If we are comfortable with that this should probably be removed, though I put as xfail for visibility atm

WillAyd · 2018-03-12T04:35:14Z

pandas/tests/io/test_html.py

+        bad = UnseekableStringIO('''
+            <table><tr><td>spam<foobr />eggs</td></tr></table>''')

+        assert self.read_html(bad)


I changed this test to remove a seek back to the start of the file that lxml was fine with. By definition of unseekable I'm not sure how that seek call was allowed...

jreback · 2018-03-12T10:32:47Z

pandas/tests/io/test_html.py


+def _missing_bs4():
+    bs4 = td.safe_import('bs4')
+    if not bs4 or LooseVersion(bs4.__version__) == LooseVersion('4.2.0'):


note that would be ok requiring a later version of bs4 to avoid some of the older issues

Updated the min version. FWIW in ci/requirements-2.7_COMPAT.pip the requirement is hardcoded at 4.2.0 and ci/requirements-2.7_LOCALE.pip has 4.2.1. I'm not sure the backstory to those but I assume that will cause conflict

May be rooted back in #4259?

pep8speaks · 2018-03-13T06:52:49Z

Hello @WillAyd! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on March 13, 2018 at 23:39 Hours UTC

codecov · 2018-03-13T06:53:12Z

Codecov Report

❗ No coverage uploaded for pull request base (master@3783ccc). Click here to learn what that means.
The diff coverage is 81.81%.

@@            Coverage Diff            @@
##             master   #20293   +/-   ##
=========================================
  Coverage          ?   91.74%           
=========================================
  Files             ?      150           
  Lines             ?    49154           
  Branches          ?        0           
=========================================
  Hits              ?    45097           
  Misses            ?     4057           
  Partials          ?        0

Flag	Coverage Δ
#multiple	`90.13% <81.81%> (?)`
#single	`41.9% <13.63%> (?)`

Impacted Files	Coverage Δ
pandas/compat/__init__.py	`57.74% <66.66%> (ø)`
pandas/io/html.py	`88.79% <84.21%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3783ccc...50d072d. Read the comment docs.

jreback

can you add a note indicating the new min for bs4, may need to update install.rst, does this have equiv tests to before (lots of remove code is good!)

jreback · 2018-03-13T10:07:00Z

pandas/tests/io/test_html.py

-    _skip_if_none_of(('bs4', 'html5lib'))
+def _missing_bs4():
+    bs4 = td.safe_import('bs4')
+    if not bs4 or LooseVersion(bs4.__version__) == LooseVersion('4.2.0'):


WillAyd · 2018-03-13T19:01:47Z

Missed your question before but the way this worked is that I combined all of the lxml and bs4 test classes into one, deleting tests where duplicated and also moving in some of the top level tests that weren't being tested for both (ex: test_importcheck_thread_safety)

The following were deleted:

test_data_fail as the changes to the lxml parser to make it more robust didn't trigger this failure
test_bs4_finds_tablesas it wasn't something that could be shared, and also wasn't testing the parser
test_lxml_finds_tables similar reason as above

jreback · 2018-03-13T22:53:19Z

doc/source/whatsnew/v0.23.0.txt

-+-----------------+-----------------+----------+
+-----------------+-----------------+----------+---------------+
+| Package         | Minimum Version | Required |     Issue     |
+=================+=================+==========+===============+


jreback · 2018-03-13T22:54:14Z

ci/requirements-optional-conda.txt

@@ -1,4 +1,4 @@
-beautifulsoup4
+beautifulsoup4>=4.2.1


I think you need to change this:

ci/requirements-2.7_COMPAT.pip:beautifulsoup4==4.2.0

Figured as such. Without knowing the full impact, is that req file for a Travis build that tests the 4.2.0 incompatibility (ref #4259). If so do we even need this?

you can just change it to the new minimum version 4.2.1 (we have 1 other build which is pinned as well). always like to test 1 or 2 builds for the min.

WillAyd · 2018-03-14T05:41:34Z

CircleCI failure was a timeout - don't believe it's related to this change

jreback · 2018-03-14T10:50:03Z

thanks

note I saw a single skip on my local

pandas/tests/io/test_html.py::TestReadHtml::test_parse_failure_unseekable[lxml] SKIPPED                                                                                                                                            [ 93%]

jreback · 2018-03-14T10:50:57Z

SKIP [1] /Users/jreback/pandas/pandas/tests/io/test_html.py:842: Not applicable for lxml

WillAyd · 2018-03-14T15:50:38Z

Yep there was one test that was testing the failure message when trying to read an unseekable IO object twice. lxml actually didn't have any problem reading it twice, so I placed an imperative skip for that parser in the test

WillAyd added 8 commits March 9, 2018 10:30

Converted bs4 class to pytest template

7230b69

Moved all tests to shared class

9cb215e

Added in appropriate skips; cleaned up funcs

8f0ce4d

Added reload to compat

476c19a

Merge remote-tracking branch 'upstream/master' into cln-html-tests

e8f356f

LINT fixes

478601c

Merge remote-tracking branch 'upstream/master' into cln-html-tests

d407464

Py27 compat

2360224

WillAyd commented Mar 12, 2018

View reviewed changes

jreback reviewed Mar 12, 2018

View reviewed changes

jreback added the IO HTML read_html, to_html, Styler.apply, Styler.applymap label Mar 12, 2018

WillAyd added 3 commits March 12, 2018 23:35

Merge remote-tracking branch 'upstream/master' into cln-html-tests

3d56d8b

Increased bs4 min version req

a93a5a3

Removed xfail test for lxml

29904d1

LINTing

f488fc8

jreback requested changes Mar 13, 2018

View reviewed changes

WillAyd added 3 commits March 13, 2018 10:56

Clean up unnecessary test

41b77e1

Updated documentation

d44b164

Merge remote-tracking branch 'upstream/master' into cln-html-tests

8cc21b9

jreback reviewed Mar 13, 2018

View reviewed changes

WillAyd added 3 commits March 13, 2018 16:36

Merge remote-tracking branch 'upstream/master' into cln-html-tests

6cff662

Bumped bs4 build req

e6943b1

LINT fix

50d072d

jreback added this to the 0.23.0 milestone Mar 14, 2018

jreback added the Testing pandas testing functions or related to the test suite label Mar 14, 2018

jreback approved these changes Mar 14, 2018

View reviewed changes

jreback merged commit cabc05f into pandas-dev:master Mar 14, 2018

WillAyd deleted the cln-html-tests branch March 14, 2018 15:49

WillAyd mentioned this pull request Dec 11, 2018

read_html() performance when HTML malformed #14312

Closed

		@@ -1,4 +1,4 @@
		beautifulsoup4
		beautifulsoup4>=4.2.1

Uh oh!

Clean / Consolidate pandas/tests/io/test_html.py #20293

Clean / Consolidate pandas/tests/io/test_html.py #20293

Uh oh!

Conversation

WillAyd commented Mar 12, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pep8speaks commented Mar 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated on March 13, 2018 at 23:39 Hours UTC

Uh oh!

codecov bot commented Mar 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WillAyd commented Mar 13, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WillAyd commented Mar 14, 2018

Uh oh!

jreback commented Mar 14, 2018

Uh oh!

jreback commented Mar 14, 2018

Uh oh!

WillAyd commented Mar 14, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pep8speaks commented Mar 13, 2018 •

edited

Loading

codecov bot commented Mar 13, 2018 •

edited

Loading