-
-
Notifications
You must be signed in to change notification settings - Fork 999
Add charset_normalizer detection.
#1791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Return the encoding, as detemined by `charset_normalizer`. | ||
| """ | ||
| content = getattr(self, "_content", b"") | ||
| if len(content) < 32: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are cases where the detection works just fine with small content. I would suggest silent the warning instead.
(target: x.encode("utf_8"))
* Using the following `Qu'est ce que une étoile?`
chardet detect ISO-8859-1
cchardet detect IBM852
charset-normalizer detect utf-8
* Using the following `Qu’est ce que une étoile?`
chardet detect utf-8
cchardet detect UTF-8
charset-normalizer detect utf-8
* Using the following `<?xml ?><c>Financiën</c>`
chardet detect ISO-8859-1
cchardet detect ISO-8859-13
charset-normalizer detect utf-8
* Using the following `(° ͜ʖ °), creepy face, smiley 😀`
chardet detect Windows-1254
cchardet detect UTF-8
charset-normalizer detect utf-8
* Using the following `["Financiën", "La France"]`
chardet detect utf-8
cchardet detect ISO-8859-13
charset-normalizer detect utf-8
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with that originally, and a couple of the tests with small amounts of content returned results I wasn't expecting. If apparent_encoding is None, then we'll end up decoding it with 'utf-8', errors='replace', whichI figure is a pretty reasonable default for the corner case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alrighty, that is reasonable. 👍
In another matter, you may run the detection anyway and check if the result has a SIG/BOM, that could be reasonable too. And discard it if len(content) < 32 and best_guess.bom is False.
results = from_bytes(content)
best_guess = results.best()
if best_guess.bom:
...
Use
charset_normalizerto auto-detect character encodings in cases where noContent-Type: text/...; charset=...is included.See #1657 and https://github.com/tomchristie/top-1000 for some evidence-led rationale behind this change.