You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For messages parsed via parse_from_bytes (or MailParser.from_bytes), text body parts whose Content-Transfer-Encoding is 8bit, 7bit, or absent are mis-decoded: UTF-8 byte sequences are passed through raw-unicode-escape, which interprets them as Latin-1. The result is silent mojibake — no exception is raised, and the declared charset=utf-8 from Content-Type is ignored.
This is related to but not fixed by #97 / #125. PR #125 added a try/except UnicodeDecodeError fallback to ported_string(payload, encoding=charset), but the except branch only fires on malformed \u escapes (the original #97 crash). Valid UTF-8 byte sequences decode "successfully" under raw-unicode-escape and never trigger the fallback — they are silently corrupted instead.
Steps to Reproduce
Minimal .eml (utf8.eml):
From: a@example.com
To: b@example.com
Subject: probe
Content-Type: text/plain; charset=utf-8
MIME-Version: 1.0
Hello — world
The two code paths give payload very different bytes before this block:
email.message_from_bytes(...).get_payload(decode=True) returns the raw body bytes (e.g. b'\xe2\x80\x94' for —). raw-unicode-escape decodes these as Latin-1 (no exception raised) → mojibake.
So from_string works by accident: it relies on the stdlib's internal escape-encoding to produce input that raw-unicode-escape happens to handle. from_bytes exposes the underlying logic flaw.
Why the Current Logic Is Inverted
The comment block above this code (lines 459-465) explains the intent: when get_payload(decode=True) returns bytes that Python "broke" by mishandling the encoding, the author reinterprets them via raw-unicode-escape. But the correct approach is the opposite: the part's Content-Type already declares the body's charset, so decode with that charset directly. raw-unicode-escape should never be the primary decoder for body bytes.
Suggested Fix
Decode with the declared charset directly:
ifnotcteorctein ["7bit", "8bit"]:
try:
payload=payload.decode(charset)
except (UnicodeDecodeError, LookupError):
# Fallback for legacy "\uXXXX-in-body" cases (#97); won't silently# mis-interpret UTF-8 because that bytes path is handled above.payload=payload.decode("raw-unicode-escape", errors="replace")
else:
payload=ported_string(payload, encoding=charset)
This:
Honors Content-Type: charset=... for the case it was declared for.
Preserves behavior for non-7bit/8bit CTEs (base64, quoted-printable) which already go through ported_string(..., encoding=charset).
A regression test loading the minimal .eml above via both parse_from_bytes and parse_from_string and asserting equality would catch any future re-inversion.
Environment
mail-parser: 4.1.2 (also verified in 4.2.1 / HEAD)
Python: 3.11
OS: Linux
Workaround
In our subclass we override from_bytes to decode bytes via UTF-8 and delegate to from_string, which sidesteps the broken branch. Happy to open a PR with the upstream fix if there's interest.
Summary
For messages parsed via
parse_from_bytes(orMailParser.from_bytes), text body parts whose Content-Transfer-Encoding is8bit,7bit, or absent are mis-decoded: UTF-8 byte sequences are passed throughraw-unicode-escape, which interprets them as Latin-1. The result is silent mojibake — no exception is raised, and the declaredcharset=utf-8from Content-Type is ignored.Example:
—(em-dash, U+2014, UTF-8\xe2\x80\x94) becomes the three-codepoint sequenceâ\x80\x94.This is related to but not fixed by #97 / #125. PR #125 added a
try/except UnicodeDecodeErrorfallback toported_string(payload, encoding=charset), but the except branch only fires on malformed\uescapes (the original #97 crash). Valid UTF-8 byte sequences decode "successfully" underraw-unicode-escapeand never trigger the fallback — they are silently corrupted instead.Steps to Reproduce
Minimal
.eml(utf8.eml):Output:
Expected Behavior
Both
parse_from_bytesandparse_from_stringshould produce'Hello — world\n', honoring the part's declaredcharset=utf-8.Root Cause
src/mailparser/core.py:471-475(4.2.1 / HEAD):The two code paths give
payloadvery different bytes before this block:email.message_from_bytes(...).get_payload(decode=True)returns the raw body bytes (e.g.b'\xe2\x80\x94'for—).raw-unicode-escapedecodes these as Latin-1 (no exception raised) → mojibake.email.message_from_string(...).get_payload(decode=True)returns ASCII-escaped bytes (e.g.b'\\u2014').raw-unicode-escapedecodes the—escape correctly → correct character.So
from_stringworks by accident: it relies on the stdlib's internal escape-encoding to produce input thatraw-unicode-escapehappens to handle.from_bytesexposes the underlying logic flaw.Why the Current Logic Is Inverted
The comment block above this code (lines 459-465) explains the intent: when
get_payload(decode=True)returns bytes that Python "broke" by mishandling the encoding, the author reinterprets them viaraw-unicode-escape. But the correct approach is the opposite: the part'sContent-Typealready declares the body's charset, so decode with that charset directly.raw-unicode-escapeshould never be the primary decoder for body bytes.Suggested Fix
Decode with the declared charset directly:
This:
Content-Type: charset=...for the case it was declared for.raw-unicode-escapepath available for messages where UnicodeDecodeError when parsing email with "\u" in its body #97-style\uliterals appear in 8bit bodies labeled as ASCII/UTF-8.7bit/8bitCTEs (base64,quoted-printable) which already go throughported_string(..., encoding=charset).A regression test loading the minimal
.emlabove via bothparse_from_bytesandparse_from_stringand asserting equality would catch any future re-inversion.Environment
Workaround
In our subclass we override
from_bytesto decode bytes via UTF-8 and delegate tofrom_string, which sidesteps the broken branch. Happy to open a PR with the upstream fix if there's interest.