Skip to content

parse_from_bytes silently mis-decodes 8bit UTF-8 bodies as Latin-1 via raw-unicode-escape #152

@ymyke

Description

@ymyke

Summary

For messages parsed via parse_from_bytes (or MailParser.from_bytes), text body parts whose Content-Transfer-Encoding is 8bit, 7bit, or absent are mis-decoded: UTF-8 byte sequences are passed through raw-unicode-escape, which interprets them as Latin-1. The result is silent mojibake — no exception is raised, and the declared charset=utf-8 from Content-Type is ignored.

Example: (em-dash, U+2014, UTF-8 \xe2\x80\x94) becomes the three-codepoint sequence â\x80\x94.

This is related to but not fixed by #97 / #125. PR #125 added a try/except UnicodeDecodeError fallback to ported_string(payload, encoding=charset), but the except branch only fires on malformed \u escapes (the original #97 crash). Valid UTF-8 byte sequences decode "successfully" under raw-unicode-escape and never trigger the fallback — they are silently corrupted instead.

Steps to Reproduce

Minimal .eml (utf8.eml):

From: a@example.com
To: b@example.com
Subject: probe
Content-Type: text/plain; charset=utf-8
MIME-Version: 1.0

Hello — world
import mailparser

m_bytes  = mailparser.parse_from_bytes(open("utf8.eml", "rb").read())
m_string = mailparser.parse_from_string(open("utf8.eml").read())

print("from_bytes :", repr(m_bytes.text_plain[0]))
print("from_string:", repr(m_string.text_plain[0]))

Output:

from_bytes : 'Hello â\x80\x94 world\n'
from_string: 'Hello — world\n'

Expected Behavior

Both parse_from_bytes and parse_from_string should produce 'Hello — world\n', honoring the part's declared charset=utf-8.

Root Cause

src/mailparser/core.py:471-475 (4.2.1 / HEAD):

if not cte or cte in ["7bit", "8bit"]:
    try:
        payload = payload.decode("raw-unicode-escape")
    except UnicodeDecodeError:
        payload = ported_string(payload, encoding=charset)
else:
    payload = ported_string(payload, encoding=charset)

The two code paths give payload very different bytes before this block:

  • email.message_from_bytes(...).get_payload(decode=True) returns the raw body bytes (e.g. b'\xe2\x80\x94' for ). raw-unicode-escape decodes these as Latin-1 (no exception raised) → mojibake.
  • email.message_from_string(...).get_payload(decode=True) returns ASCII-escaped bytes (e.g. b'\\u2014'). raw-unicode-escape decodes the escape correctly → correct character.

So from_string works by accident: it relies on the stdlib's internal escape-encoding to produce input that raw-unicode-escape happens to handle. from_bytes exposes the underlying logic flaw.

Why the Current Logic Is Inverted

The comment block above this code (lines 459-465) explains the intent: when get_payload(decode=True) returns bytes that Python "broke" by mishandling the encoding, the author reinterprets them via raw-unicode-escape. But the correct approach is the opposite: the part's Content-Type already declares the body's charset, so decode with that charset directly. raw-unicode-escape should never be the primary decoder for body bytes.

Suggested Fix

Decode with the declared charset directly:

if not cte or cte in ["7bit", "8bit"]:
    try:
        payload = payload.decode(charset)
    except (UnicodeDecodeError, LookupError):
        # Fallback for legacy "\uXXXX-in-body" cases (#97); won't silently
        # mis-interpret UTF-8 because that bytes path is handled above.
        payload = payload.decode("raw-unicode-escape", errors="replace")
else:
    payload = ported_string(payload, encoding=charset)

This:

  • Honors Content-Type: charset=... for the case it was declared for.
  • Keeps the raw-unicode-escape path available for messages where UnicodeDecodeError when parsing email with "\u" in its body #97-style \u literals appear in 8bit bodies labeled as ASCII/UTF-8.
  • Preserves behavior for non-7bit/8bit CTEs (base64, quoted-printable) which already go through ported_string(..., encoding=charset).

A regression test loading the minimal .eml above via both parse_from_bytes and parse_from_string and asserting equality would catch any future re-inversion.

Environment

  • mail-parser: 4.1.2 (also verified in 4.2.1 / HEAD)
  • Python: 3.11
  • OS: Linux

Workaround

In our subclass we override from_bytes to decode bytes via UTF-8 and delegate to from_string, which sidesteps the broken branch. Happy to open a PR with the upstream fix if there's interest.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions