@@ -2043,8 +2043,8 @@ Reading HTML Content
20432043
20442044.. warning ::
20452045
2046- We **highly encourage ** you to read the :ref: `HTML Table Parsing gotchas<gotchas .html> `
2047- regarding the issues surrounding the BeautifulSoup4/html5lib/lxml parsers.
2046+ We **highly encourage ** you to read the :ref: `HTML Table Parsing gotchas<io .html.gotchas > `
2047+ below regarding the issues surrounding the BeautifulSoup4/html5lib/lxml parsers.
20482048
20492049.. versionadded :: 0.12.0
20502050
@@ -2346,6 +2346,83 @@ Not escaped:
23462346 Some browsers may not show a difference in the rendering of the previous two
23472347 HTML tables.
23482348
2349+
2350+ .. _io.html.gotchas :
2351+
2352+ HTML Table Parsing Gotchas
2353+ ''''''''''''''''''''''''''
2354+
2355+ There are some versioning issues surrounding the libraries that are used to
2356+ parse HTML tables in the top-level pandas io function ``read_html ``.
2357+
2358+ **Issues with ** |lxml |_
2359+
2360+ * Benefits
2361+
2362+ * |lxml |_ is very fast
2363+
2364+ * |lxml |_ requires Cython to install correctly.
2365+
2366+ * Drawbacks
2367+
2368+ * |lxml |_ does *not * make any guarantees about the results of its parse
2369+ *unless * it is given |svm |_.
2370+
2371+ * In light of the above, we have chosen to allow you, the user, to use the
2372+ |lxml |_ backend, but **this backend will use ** |html5lib |_ if |lxml |_
2373+ fails to parse
2374+
2375+ * It is therefore *highly recommended * that you install both
2376+ |BeautifulSoup4 |_ and |html5lib |_, so that you will still get a valid
2377+ result (provided everything else is valid) even if |lxml |_ fails.
2378+
2379+ **Issues with ** |BeautifulSoup4 |_ **using ** |lxml |_ **as a backend **
2380+
2381+ * The above issues hold here as well since |BeautifulSoup4 |_ is essentially
2382+ just a wrapper around a parser backend.
2383+
2384+ **Issues with ** |BeautifulSoup4 |_ **using ** |html5lib |_ **as a backend **
2385+
2386+ * Benefits
2387+
2388+ * |html5lib |_ is far more lenient than |lxml |_ and consequently deals
2389+ with *real-life markup * in a much saner way rather than just, e.g.,
2390+ dropping an element without notifying you.
2391+
2392+ * |html5lib |_ *generates valid HTML5 markup from invalid markup
2393+ automatically *. This is extremely important for parsing HTML tables,
2394+ since it guarantees a valid document. However, that does NOT mean that
2395+ it is "correct", since the process of fixing markup does not have a
2396+ single definition.
2397+
2398+ * |html5lib |_ is pure Python and requires no additional build steps beyond
2399+ its own installation.
2400+
2401+ * Drawbacks
2402+
2403+ * The biggest drawback to using |html5lib |_ is that it is slow as
2404+ molasses. However consider the fact that many tables on the web are not
2405+ big enough for the parsing algorithm runtime to matter. It is more
2406+ likely that the bottleneck will be in the process of reading the raw
2407+ text from the URL over the web, i.e., IO (input-output). For very large
2408+ tables, this might not be true.
2409+
2410+
2411+ .. |svm | replace :: **strictly valid markup **
2412+ .. _svm : http://validator.w3.org/docs/help.html#validation_basics
2413+
2414+ .. |html5lib | replace :: **html5lib **
2415+ .. _html5lib : https://github.com/html5lib/html5lib-python
2416+
2417+ .. |BeautifulSoup4 | replace :: **BeautifulSoup4 **
2418+ .. _BeautifulSoup4 : http://www.crummy.com/software/BeautifulSoup
2419+
2420+ .. |lxml | replace :: **lxml **
2421+ .. _lxml : http://lxml.de
2422+
2423+
2424+
2425+
23492426.. _io.excel :
23502427
23512428Excel files
0 commit comments