This is a python3 library, mainly does what WHATWG html5 spec calls 'prescan'.
Check UTF-8 BOM and UTF-16 BOM.
If it finds it, strips the BOM, and move on to 3.
(Strictly, This is not 'prescan' proper, a process before it).
Prescan (parse <meta> tag to get Encoding Name)
Resolve the retrieved Name to a Python codec name
Note it just returns Python codec name string, not codec object.
$ pip install html5prescanhtml5prescan.get(buf, length=1024, jsonfile=None)Parse input byte string buf, and return (Scan, buf).
Scan is a namedtuple with fields:
label: Encoding Label name: Encoding Name pyname: Python codec name start: start position of the match end: end position of the match match: matched substring
The match is from '<meta' to the byte position
where successful parsing returned.
Encoding Label and Encoding Name are defined
in WHATWG Encoding.
The site provides encodings.json file for programmatic usage,
and by default the library uses the copy of it (when jsonfile argument is None).
See the docstring of html5prescan.get for the details
(e.g. $ pydoc 'html5prescan.get').
---
As a commandline script, if there is no argument,
it reads standard input, and return Scan.
$ html5prescan
<meta charset=greek>
(CTRL+D)
Scan(label='greek', name='ISO-8859-7', pyname='ISO-8859-7',
start=0, end=20, match='<meta charset=greek>')In any other cases, it just prints help message.
To test, run make test.
The test data files are derived from html5lib/encoding/tests*.dat files.
The original tests are for the main html parser, not for prescan parser,
so I edited and renamed them (prescan1.dat and prescan2.dat).
See the first six commits for the diffs.
I also added some more tests ad hoc (prescan3.dat).
Then, I tested the test data against well-known libraries
(validator, jsdom, html5lib).
I reported all inconsistencies upstream,
and validator and jsdom maintainers confirmed my interpretations.
So I believe my library and tests are in a good state.
For the details, see test/resource/memo/201910-comparison.rst.
The library imitates WHATWG prescan algorithm in Python code (countless small bytes slicing and copying). So it is naturally slow. But It is better to know how slow.
scrapy/w3lib uses well maintained, therefore, relatively complex, regex search
to get encoding declaration.
(I think regex is mostly done in C or below in Python.)
From my humble tests, I've got the result that the library is about 20 times slower than w3lib.
I think this is in the range of expectation, not good, but not bad either.
For the details, see test/resource/memo/201910-performance.rst.
Around 2013, WHATWG introduced a new encoding called 'replacement'.
It is to mask some insecure non-ascii-compatible encodings,
and it just decodes to one U+FFFD unicode for any length of the input bytes.
Python doesn't have a codec corresponding to this encoding,
and this library returns None for pyname.
Users may need to add an extra check for this encoding.
The library includes an implementation of this codec (replacement.py).
So in very rare cases, users may want to look at it.
If users want to register this codec, call replacement.register().
https://github.com/zackw/html5-chardet
It is a C version of validator's MetaScanner.java.
He also uses html5lib tests edited for prescan.
So I am obviously following his path.
Relevant WHATWG html specs for prescan are:
- https://html.spec.whatwg.org/multipage/parsing.html#prescan-a-byte-stream-to-determine-its-encoding
- https://html.spec.whatwg.org/multipage/parsing.html#concept-get-attributes-when-sniffing
- https://html.spec.whatwg.org/multipage/urls-and-fetching.html#extracting-character-encodings-from-meta-elements
Is is just a part of the initial encode determination process.
---
validator, jsdom, html5-lib, w3lib:
- https://github.com/validator/htmlparser
- https://github.com/jsdom/html-encoding-sniffer
- https://github.com/html5lib/html5lib-python
- https://github.com/scrapy/w3lib
The software is licensed under The MIT License. See LICENSE.